Bug 1383559 - [Ganesha] : rm -rf * hangs during continuous I/O .
Summary: [Ganesha] : rm -rf * hangs during continuous I/O .
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: nfs-ganesha
Version: rhgs-3.2
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
: RHGS 3.5.0
Assignee: Frank Filz
QA Contact: Manisha Saini
URL:
Whiteboard:
Depends On:
Blocks: 1696807
TreeView+ depends on / blocked
 
Reported: 2016-10-11 04:06 UTC by Ambarish
Modified: 2019-10-30 12:16 UTC (History)
15 users (show)

Fixed In Version: nfs-ganesha-2.7.3-3
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-10-30 12:15:39 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2019:3252 0 None None None 2019-10-30 12:16:15 UTC

Description Ambarish 2016-10-11 04:06:13 UTC
Description of problem:
-----------------------

Setup consists of a 4 node cluster and the volume mounted on 6 clients - 4 via VIP and 2 via physical IP.
rm -rf * from one of my clients was hung(with continuous writes from other clients) for almost 10 hours after beginning execution.A normal keyboard interrupt did not help me break out of it.

I/O wasn't affected,though.

pcs status shows OK :

[root@gqas013 ~]# pcs status
Cluster name: G1474623742.03
Last updated: Mon Oct 10 23:49:36 2016		Last change: Mon Oct 10 13:14:53 2016 by root via cibadmin on gqas006.sbu.lab.eng.bos.redhat.com
Stack: corosync
Current DC: gqas006.sbu.lab.eng.bos.redhat.com (version 1.1.13-10.el7-44eb2dd) - partition with quorum
4 nodes and 16 resources configured

Online: [ gqas005.sbu.lab.eng.bos.redhat.com gqas006.sbu.lab.eng.bos.redhat.com gqas011.sbu.lab.eng.bos.redhat.com gqas013.sbu.lab.eng.bos.redhat.com ]

Full list of resources:

 Clone Set: nfs_setup-clone [nfs_setup]
     Started: [ gqas005.sbu.lab.eng.bos.redhat.com gqas006.sbu.lab.eng.bos.redhat.com gqas011.sbu.lab.eng.bos.redhat.com gqas013.sbu.lab.eng.bos.redhat.com ]
 Clone Set: nfs-mon-clone [nfs-mon]
     Started: [ gqas005.sbu.lab.eng.bos.redhat.com gqas006.sbu.lab.eng.bos.redhat.com gqas011.sbu.lab.eng.bos.redhat.com gqas013.sbu.lab.eng.bos.redhat.com ]
 Clone Set: nfs-grace-clone [nfs-grace]
     Started: [ gqas005.sbu.lab.eng.bos.redhat.com gqas006.sbu.lab.eng.bos.redhat.com gqas011.sbu.lab.eng.bos.redhat.com gqas013.sbu.lab.eng.bos.redhat.com ]
 gqas013.sbu.lab.eng.bos.redhat.com-cluster_ip-1	(ocf::heartbeat:IPaddr):	Started gqas013.sbu.lab.eng.bos.redhat.com
 gqas005.sbu.lab.eng.bos.redhat.com-cluster_ip-1	(ocf::heartbeat:IPaddr):	Started gqas005.sbu.lab.eng.bos.redhat.com
 gqas006.sbu.lab.eng.bos.redhat.com-cluster_ip-1	(ocf::heartbeat:IPaddr):	Started gqas006.sbu.lab.eng.bos.redhat.com
 gqas011.sbu.lab.eng.bos.redhat.com-cluster_ip-1	(ocf::heartbeat:IPaddr):	Started gqas011.sbu.lab.eng.bos.redhat.com

PCSD Status:
  gqas013.sbu.lab.eng.bos.redhat.com: Online
  gqas005.sbu.lab.eng.bos.redhat.com: Online
  gqas006.sbu.lab.eng.bos.redhat.com: Online
  gqas011.sbu.lab.eng.bos.redhat.com: Online

Daemon Status:
  corosync: active/disabled
  pacemaker: active/enabled
  pcsd: active/disabled
[root@gqas013 ~]# 

ganesha,pacemaker,corosync,pcsd,glusterd were active and alive at all times.

I could not take BT of the hung process as gdb itself was hanging(coz of the hung process).

sosreport,ganesha logs and tcp dump location in commets.Since the issue is pretty consistent,I can work with Dev on whatever else they may need.


Version-Release number of selected component (if applicable):
-------------------------------------------------------------

nfs-ganesha-2.4.0-2.el7rhgs.x86_64
glusterfs-ganesha-3.8.4-2.el7rhgs.x86_64


How reproducible:
----------------

2/2

Steps to Reproduce:
------------------

1. Mount the volume on the client via v4.

2. Run I/O.After an hour into the workload,trigger rm -rf * from one of the mounts.

Actual results:
---------------

rm hangs and cannot be interrupted from the keyboard via Ctrl+C/X/Z.

Expected results:
-----------------

No hangs.

Additional info:
----------------

* mount vers=4

* Client/Server OS : RHEL 7.2

*Vol Config* :

Volume Name: testvol
Type: Distributed-Replicate
Volume ID: b93b99bd-d1d2-4236-98bc-08311f94e7dc
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: gqas013.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick0
Brick2: gqas005.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick1
Brick3: gqas006.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick2
Brick4: gqas011.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick3
Options Reconfigured:
diagnostics.count-fop-hits: on
diagnostics.latency-measurement: on
ganesha.enable: on
features.cache-invalidation: off
nfs.disable: on
performance.readdir-ahead: on
performance.stat-prefetch: off
server.allow-insecure: on
nfs-ganesha: enable
cluster.enable-shared-storage: enable

Comment 3 Ambarish 2016-10-11 04:13:29 UTC
Last few lines from dmesg on the client where rm was hung :

[15787.291198] nfs: server 192.168.79.152 not responding, still trying
[15829.189303] nfs: server 192.168.79.152 OK
[16009.508639] nfs: server 192.168.79.152 not responding, still trying
[16049.900337] nfs: server 192.168.79.152 OK
[16230.190385] nfs: server 192.168.79.152 not responding, still trying
[16266.302110] nfs: server 192.168.79.152 OK
[16446.519894] nfs: server 192.168.79.152 not responding, still trying
[16481.261059] nfs: server 192.168.79.152 OK
[16661.569060] nfs: server 192.168.79.152 not responding, still trying
[16682.844981] nfs: server 192.168.79.152 OK
[16863.049949] nfs: server 192.168.79.152 not responding, still trying
[16883.367595] nfs: server 192.168.79.152 OK
[17063.506703] nfs: server 192.168.79.152 not responding, still trying
[17083.507404] nfs: server 192.168.79.152 OK
[17263.707289] nfs: server 192.168.79.152 not responding, still trying
[17286.747354] nfs: server 192.168.79.152 OK
[17466.980140] nfs: server 192.168.79.152 not responding, still trying
[17492.516096] nfs: server 192.168.79.152 OK
[17672.813343] nfs: server 192.168.79.152 not responding, still trying
[17700.774987] nfs: server 192.168.79.152 OK
[17880.950164] nfs: server 192.168.79.152 not responding, still trying
[17912.274259] nfs: server 192.168.79.152 OK
[18092.415398] nfs: server 192.168.79.152 not responding, still trying
[18123.437874] nfs: server 192.168.79.152 OK
[18303.624676] nfs: server 192.168.79.152 not responding, still trying
[18338.743966] nfs: server 192.168.79.152 OK
[18518.930261] nfs: server 192.168.79.152 not responding, still trying
[18558.453436] nfs: server 192.168.79.152 OK
[18738.587582] nfs: server 192.168.79.152 not responding, still trying
[18781.420108] nfs: server 192.168.79.152 OK
[18961.573582] nfs: server 192.168.79.152 not responding, still trying
[19009.816447] nfs: server 192.168.79.152 OK
[19189.935436] nfs: server 192.168.79.152 not responding, still trying
[19239.119287] nfs: server 192.168.79.152 OK
[19419.321237] nfs: server 192.168.79.152 not responding, still trying
[19470.307914] nfs: server 192.168.79.152 OK
[19650.499519] nfs: server 192.168.79.152 not responding, still trying
[19702.953909] nfs: server 192.168.79.152 OK
[19883.213746] nfs: server 192.168.79.152 not responding, still trying
[19941.678346] nfs: server 192.168.79.152 OK
[20121.815849] nfs: server 192.168.79.152 not responding, still trying
[20180.769899] nfs: server 192.168.79.152 OK
[20360.930509] nfs: server 192.168.79.152 not responding, still trying
[20425.082200] nfs: server 192.168.79.152 OK
[20605.164898] nfs: server 192.168.79.152 not responding, still trying
[20671.109880] nfs: server 192.168.79.152 OK
[20851.191604] nfs: server 192.168.79.152 not responding, still trying
[20919.229722] nfs: server 192.168.79.152 OK
[21099.522467] nfs: server 192.168.79.152 not responding, still trying
[21169.330454] nfs: server 192.168.79.152 OK
[21349.517303] nfs: server 192.168.79.152 not responding, still trying
[21426.511795] nfs: server 192.168.79.152 OK
[21606.680499] nfs: server 192.168.79.152 not responding, still trying
[21686.828542] nfs: server 192.168.79.152 OK
[21867.043910] nfs: server 192.168.79.152 not responding, still trying
[21949.950313] nfs: server 192.168.79.152 OK
[22130.223292] nfs: server 192.168.79.152 not responding, still trying
[22214.748634] nfs: server 192.168.79.152 OK
[22394.938944] nfs: server 192.168.79.152 not responding, still trying
[22485.755527] nfs: server 192.168.79.152 OK
[22666.054944] nfs: server 192.168.79.152 not responding, still trying
[22757.152999] nfs: server 192.168.79.152 OK
[22937.426469] nfs: server 192.168.79.152 not responding, still trying
[23032.585237] nfs: server 192.168.79.152 OK
[23212.894709] nfs: server 192.168.79.152 not responding, still trying
[23313.699666] nfs: server 192.168.79.152 OK
[23493.994685] nfs: server 192.168.79.152 not responding, still trying
[23597.995454] nfs: server 192.168.79.152 OK
[23778.167395] nfs: server 192.168.79.152 not responding, still trying
[23884.246880] nfs: server 192.168.79.152 OK
[24064.387741] nfs: server 192.168.79.152 not responding, still trying
[24175.945142] nfs: server 192.168.79.152 OK
[24356.240259] nfs: server 192.168.79.152 not responding, still trying
[24469.714996] nfs: server 192.168.79.152 OK
[24649.885912] nfs: server 192.168.79.152 not responding, still trying
[24770.392334] nfs: server 192.168.79.152 OK
[24950.698811] nfs: server 192.168.79.152 not responding, still trying
[25074.436048] nfs: server 192.168.79.152 OK
[25254.584307] nfs: server 192.168.79.152 not responding, still trying
[25381.491915] nfs: server 192.168.79.152 OK
[25561.797420] nfs: server 192.168.79.152 not responding, still trying
[25693.170982] nfs: server 192.168.79.152 OK
[25873.363190] nfs: server 192.168.79.152 not responding, still trying
[26007.629525] nfs: server 192.168.79.152 OK
[26187.744721] nfs: server 192.168.79.152 not responding, still trying
[26328.828752] nfs: server 192.168.79.152 OK
[26509.038729] nfs: server 192.168.79.152 not responding, still trying
[26652.198925] nfs: server 192.168.79.152 OK
[26832.381036] nfs: server 192.168.79.152 not responding, still trying
[26979.925668] nfs: server 192.168.79.152 OK
[27160.075098] nfs: server 192.168.79.152 not responding, still trying
[27311.543851] nfs: server 192.168.79.152 OK
[27491.737763] nfs: server 192.168.79.152 not responding, still trying
[27648.618471] nfs: server 192.168.79.152 OK
[27828.776258] nfs: server 192.168.79.152 not responding, still trying
[27989.645529] nfs: server 192.168.79.152 OK
[28169.783396] nfs: server 192.168.79.152 not responding, still trying
[28335.413907] nfs: server 192.168.79.152 OK
[28515.654211] nfs: server 192.168.79.152 not responding, still trying
[28685.605315] nfs: server 192.168.79.152 OK
[28865.877812] nfs: server 192.168.79.152 not responding, still trying
[29042.667438] nfs: server 192.168.79.152 OK
[29222.757053] nfs: server 192.168.79.152 not responding, still trying
[29399.954698] nfs: server 192.168.79.152 OK
[29580.148863] nfs: server 192.168.79.152 not responding, still trying
[29764.278283] nfs: server 192.168.79.152 OK
[29944.452511] nfs: server 192.168.79.152 not responding, still trying
[30132.901852] nfs: server 192.168.79.152 OK
[30313.108856] nfs: server 192.168.79.152 not responding, still trying
[30507.374532] nfs: server 192.168.79.152 OK
[30687.652930] nfs: server 192.168.79.152 not responding, still trying
[30885.120391] nfs: server 192.168.79.152 OK
[31065.269669] nfs: server 192.168.79.152 not responding, still trying
[31269.086123] nfs: server 192.168.79.152 OK
[31449.286149] nfs: server 192.168.79.152 not responding, still trying
[31657.934197] nfs: server 192.168.79.152 OK
[31838.167220] nfs: server 192.168.79.152 not responding, still trying
[32053.904796] nfs: server 192.168.79.152 OK
[32234.216394] nfs: server 192.168.79.152 not responding, still trying
[32453.459217] nfs: server 192.168.79.152 OK
[32633.594217] nfs: server 192.168.79.152 not responding, still trying
[32858.505960] nfs: server 192.168.79.152 OK
[33038.603535] nfs: server 192.168.79.152 not responding, still trying
[33269.554897] nfs: server 192.168.79.152 OK
[33449.757447] nfs: server 192.168.79.152 not responding, still trying
[33686.773525] nfs: server 192.168.79.152 OK
[33867.055842] nfs: server 192.168.79.152 not responding, still trying
[34107.805884] nfs: server 192.168.79.152 OK
[34287.938014] nfs: server 192.168.79.152 not responding, still trying
[34537.542505] nfs: server 192.168.79.152 OK
[34717.781094] nfs: server 192.168.79.152 not responding, still trying
[34971.558902] nfs: server 192.168.79.152 OK
[35151.719727] nfs: server 192.168.79.152 not responding, still trying
[35409.633699] nfs: server 192.168.79.152 OK
[35589.755241] nfs: server 192.168.79.152 not responding, still trying
[35854.690875] nfs: server 192.168.79.152 OK
[36034.958509] nfs: server 192.168.79.152 not responding, still trying
[36305.122210] nfs: server 192.168.79.152 OK
[36485.281962] nfs: server 192.168.79.152 not responding, still trying
[36760.839711] nfs: server 192.168.79.152 OK
[36940.982055] nfs: server 192.168.79.152 not responding, still trying
[37222.462890] nfs: server 192.168.79.152 OK
[37402.570004] nfs: server 192.168.79.152 not responding, still trying
[37692.681818] nfs: server 192.168.79.152 OK
[37872.862521] nfs: server 192.168.79.152 not responding, still trying
[38170.312869] nfs: server 192.168.79.152 OK
[38350.579669] nfs: server 192.168.79.152 not responding, still trying
[38654.524461] nfs: server 192.168.79.152 OK
[38834.696511] nfs: server 192.168.79.152 not responding, still trying
[39144.660876] nfs: server 192.168.79.152 OK
[39324.957890] nfs: server 192.168.79.152 not responding, still trying
[39637.950340] nfs: server 192.168.79.152 OK
[39818.035587] nfs: server 192.168.79.152 not responding, still trying
[40127.927662] nfs: server 192.168.79.152 OK
[40308.040787] nfs: server 192.168.79.152 not responding, still trying
[root@gqac015 ~]#

Comment 5 Soumya Koduri 2016-10-14 06:57:32 UTC
As discussed, please check the following - 

* collect pkt trace for sometime (from client and server) to check if there is active I/O going on.

* Collect thread backtraces of the ganesha processes at 2-3 intervals and provide us the same. Also collect the core if possible everytime you collect the trace.

I suspect "rm -rf *" invokes READDIR which ends up taking lot of time as observed in bug1382912, if the mount point contains millions of entries.


I am working on to optimize that code path. Will provide test-patches once they are ready.

Comment 9 surabhi 2016-11-29 11:23:34 UTC
As per the triaging we all have the agreement that this BZ has to be fixed in rhgs-3.2.0. Providing qa_ack

Comment 12 Atin Mukherjee 2016-12-06 07:16:08 UTC
Upstream fix:

https://review.gerrithub.io/304278
https://review.gerrithub.io/304279

Comment 15 Kaleb KEITHLEY 2017-10-05 11:30:24 UTC
POST with rebase to nfs-ganesha-2.5.x

Comment 19 Manisha Saini 2018-07-24 14:27:38 UTC
Verified this with

# rpm -qa | grep ganesha
nfs-ganesha-2.5.5-8.el7rhgs.x86_64
glusterfs-ganesha-3.12.2-14.el7rhgs.x86_64
nfs-ganesha-gluster-2.5.5-8.el7rhgs.x86_64
nfs-ganesha-debuginfo-2.5.5-8.el7rhgs.x86_64


Steps performed for verification-

1.Created Distributed-Replicate 6*3 volume.
2.Exported the volume via ganesha
3.Mounted the volume to 6 clients via v4.4 with VIP and 2 with server physical IP of the server
4.Perform IO's from all 6 mount points.
5.While IO's is in process,after 1 hour,performed rm -rf * from one of the client.

All files were deleted from mount point.No hangs were observed.

Moving this BZ to verified state.

Comment 20 Daniel Gryniewicz 2018-08-27 12:25:34 UTC
This should be moved out of 3.4, since dirent chunk is removed.

Comment 31 Manisha Saini 2019-08-29 20:41:33 UTC
Verified this with

# rpm -qa | grep ganesha
nfs-ganesha-2.7.3-7.el7rhgs.x86_64
nfs-ganesha-debuginfo-2.7.3-7.el7rhgs.x86_64
glusterfs-ganesha-6.0-11.el7rhgs.x86_64
nfs-ganesha-gluster-2.7.3-7.el7rhgs.x86_64


Steps:
1.Create 4 node ganesha cluster
2.Create 1 Distributed-Disperse volume 2 x (4 + 2) = 12
3.Mount the volume on 6 clients via v4.1
4.Run IO's from 5 clients
5. Wait for around 1 hour and then run rm -rf * from another client with IO's still ongoing

No hungs were observed while IO's were running.Moving this BZ to verified state

Comment 33 errata-xmlrpc 2019-10-30 12:15:39 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2019:3252


Note You need to log in before you can comment on or make changes to this bug.