Bug 1476559

Summary:	[Stress] : Input/Output Error while creating files(using touch) / bonnie++/dd during MTSH.
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Ambarish <asoman>
Component:	nfs-ganesha	Assignee:	Kaleb KEITHLEY <kkeithle>
Status:	CLOSED ERRATA	QA Contact:	Manisha Saini <msaini>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	rhgs-3.3	CC:	amukherj, bturner, dang, ffilz, jthottan, kkeithle, mbenjamin, msaini, rhinduja, rhs-bugs, sheggodu, skoduri, storage-qa-internal
Target Milestone:	---
Target Release:	RHGS 3.4.0
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	nfs-ganesha-2.5.4-1	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-09-04 06:53:35 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1503134

Description Ambarish 2017-07-30 10:18:21 UTC

Description of problem:
-----------------------

2*2 gluster volume,4 clients mounted from the same server via v4.0

Brought down two bricks .

Created lots of small files.Brought all the bricks online and triggered a multithreaded self heal.

Running heal info periodically in the background as well.


Multiple touches and Bonnies failed (on 2 different clients mounted from the same server) :

<snip>


[root@gqac008 gluster-mount]# touch a
touch: cannot touch ‘a’: Input/output error

[root@gqac008 gluster-mount]# touch a
touch: cannot touch ‘a’: Input/output error

[root@gqac008 gluster-mount]# touch b
touch: cannot touch ‘b’: Input/output error
[root@gqac008 gluster-mount]# 
[root@gqac008 gluster-mount]# 

[root@gqac008 gluster-mount]# touch c
touch: cannot touch ‘c’: Input/output error

<snip>


AND ..

<snip>

Changing to the specified mountpoint
/gluster-mount/run2220
executing bonnie
Using uid:0, gid:0.
Can't open file ./Bonnie.2247

real    0m1.227s
user    0m0.002s
sys     0m0.001s
bonnie failed
0
Total 0 tests were successful
Switching over to the previous working directory
Removing /gluster-mount/run2220/
[root@gqac008 /]# 
<snip>


I did not find anything in brick logs(around Sunday,July30, 3 PM IST).

tcpdumps,logs etc will be shared in comments

Version-Release number of selected component (if applicable):
-------------------------------------------------------------

[root@gqas013 glusterfs]# rpm -qa|grep ganes
nfs-ganesha-2.4.4-16.el7rhgs.x86_64
nfs-ganesha-gluster-2.4.4-16.el7rhgs.x86_64
glusterfs-ganesha-3.8.4-36.el7rhgs.x86_64



How reproducible:
-----------------

1/1

Actual results:
---------------

EIO on mount point.

Expected results:
-----------------

No EIO on mount point.

Additional info:
---------------

Volume Name: testvol
Type: Distributed-Replicate
Volume ID: 41c5aa32-ec60-4591-ae6d-f93a0b13b47c
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: gqas013.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick0
Brick2: gqas005.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick1
Brick3: gqas006.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick2
Brick4: gqas008.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick3
Options Reconfigured:
cluster.shd-wait-qlength: 655536
cluster.shd-max-threads: 64
client.event-threads: 4
server.event-threads: 4
cluster.lookup-optimize: on
diagnostics.count-fop-hits: on
diagnostics.latency-measurement: on
ganesha.enable: on
features.cache-invalidation: on
server.allow-insecure: on
performance.stat-prefetch: off
transport.address-family: inet
nfs.disable: on
nfs-ganesha: enable
cluster.enable-shared-storage: enable
[root@gqas013 glusterfs]#

Comment 2 Ambarish 2017-07-30 10:20:41 UTC

Proposing as a blocker as application got affected.

Comment 4 Ambarish 2017-07-30 10:23:19 UTC

Test case was tried once on FUSE where it passed.

Comment 7 Ambarish 2017-07-30 12:55:45 UTC

[root@gqac019 gluster-mount]# dd if=/dev/zero of=a count=1 bs=100 conv=fdatasync
dd: failed to open ‘a’: Input/output error
[root@gqac019 gluster-mount]#

Comment 12 Daniel Gryniewicz 2017-08-01 14:11:18 UTC

Updating this:  The reaper thread is running, so it's a livelock, not a deadlock.  There is a state on the so_state_list that is not in the hashtable.  This means state_del_locked() bails early, and the state is not deleted or removed from the so_state_list, causing an infinite loop with cr_mutex held.

I'm wondering if this will fix it, but I'd like Frank to weigh in:
b049eb90e78670d3e17ffe91b5c4048f8d7520d4

There may be one or two needed on top of that.

Comment 13 Frank Filz 2017-08-01 18:26:52 UTC

There are a set of somewhat related patches, the ones marked with * are NLM (NFS v3) only, but may be required due to the other patches:

* feb12d2fe13fcdbd5ae80cbe6575af98c4657520 Fix nlm client refcount going up from zero
* acb632c319b3846200bcd718aa8e637bb0f4e1fd Fix nsm client refcount going up from zero
* 52e0e125322fb0cc5c608be4cd43b90a702d88e2 Fix nlm state refcount going up from zero
  b049eb90e78670d3e17ffe91b5c4048f8d7520d4 Convert state_owner hash table to behave like others
* 51d0f6c77d3e0d95be5ea27abe1f8c66db242884 Fix typo in hash table name in get_nlm_client()
  006575d43d77dcd5c3eefd11e9d508a33e2bf459 Fix hashtable_setlatched overwrite parameter
* 84d5ef4003e13a4078fa01d69a67bfe2ae02c61a Use care_t care instead of bool nsm_state_applies in get_nlm_state
  60e20e2e9b531910c2ef1a20ad4036ff595df66f Fix a race in using hashtables leading to crashes[root@localhost src]

There may be other relevant patches also.

Comment 14 Ambarish 2017-08-02 05:08:12 UTC

IO is successful when I try it on the same volume accessed via v3.

Comment 15 Daniel Gryniewicz 2017-08-02 14:03:25 UTC

The livelock is related to the v4 Session ID, so v3 should be unaffected.

Comment 18 Kaleb KEITHLEY 2017-08-16 12:36:02 UTC

strea

Comment 20 Kaleb KEITHLEY 2017-10-05 11:26:31 UTC

POST with rebase to nfs-ganesha-2.5.x

Comment 26 errata-xmlrpc 2018-09-04 06:53:35 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:2610