Bug 1368073

Summary:	[client-io-threads]: process not responding for more than 120 sec and its hung
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Rahul Hinduja <rhinduja>
Component:	disperse	Assignee:	Sunil Kumar Acharya <sheggodu>
Status:	CLOSED WONTFIX	QA Contact:	Nag Pavan Chilakam <nchilaka>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	rhgs-3.1	CC:	aspandey, rcyriac, rhinduja, rhs-bugs, sanandpa, storage-qa-internal
Target Milestone:	---	Keywords:	ZStream
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-04-16 18:15:58 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Rahul Hinduja 2016-08-18 09:57:16 UTC

Description of problem:
=======================

While verification of client-io-threads from Fuse mount on EC volume, observed the following on dmesg of client : 

[88440.551165] INFO: task crefi:1972 blocked for more than 120 seconds.
[88440.551223] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[88440.551283] crefi           D ffff8800aed7ccb0     0  1972   1968 0x00000080
[88440.551287]  ffff88003ec57c70 0000000000000082 ffff880036bbb980 ffff88003ec57fd8
[88440.551291]  ffff88003ec57fd8 ffff88003ec57fd8 ffff880036bbb980 ffff8800aed7cca8
[88440.551293]  ffff8800aed7ccac ffff880036bbb980 00000000ffffffff ffff8800aed7ccb0
[88440.551296] Call Trace:
[88440.551305]  [<ffffffff8163b9e9>] schedule_preempt_disabled+0x29/0x70
[88440.551309]  [<ffffffff816396e5>] __mutex_lock_slowpath+0xc5/0x1c0
[88440.551312]  [<ffffffff81638b4f>] mutex_lock+0x1f/0x2f
[88440.551316]  [<ffffffff811eb9af>] do_last+0x28f/0x1270
[88440.551320]  [<ffffffff811c11ce>] ? kmem_cache_alloc_trace+0x1ce/0x1f0
[88440.551323]  [<ffffffff811ee672>] path_openat+0xc2/0x490
[88440.551362]  [<ffffffffa01e7cd4>] ? xfs_iunlock+0xa4/0x130 [xfs]
[88440.551383]  [<ffffffffa01d45fa>] ? xfs_free_eofblocks+0xda/0x270 [xfs]
[88440.551387]  [<ffffffff811efe3b>] do_filp_open+0x4b/0xb0
[88440.551390]  [<ffffffff811fc9c7>] ? __alloc_fd+0xa7/0x130
[88440.551394]  [<ffffffff811dd7e3>] do_sys_open+0xf3/0x1f0
[88440.551397]  [<ffffffff811dd8fe>] SyS_open+0x1e/0x20
[88440.551401]  [<ffffffff81645909>] system_call_fastpath+0x16/0x1b

Scenario:
=========

1. Create EC volume (3x(8+3)) from 11 node cluster.
2. Set event-threads to 4
3. Enable client-io-threads
4. Mount the volume on client via Fuse
5. Open 10 Sessions of the client (Different terminals)
6. Create IO from 5 session using crefi on same directory. Ensure to use multi threaded crefi using -T 10. 
crefi -b 4 -d 4 -n 20 --multi --random --min 1K --max 100M -T 10 -t text --fop=create /mnt/fuse/multiple_threads/
7. From the other 5 sessions, stat all the content of directory every 30 sec.
for i in {1..100}; do echo "This is iteration $i" ; find * | xargs stat ; sleep 30 ;done
8. While step 7 and step 8 are inprogress, bring down 1 bricks from each subvolume. Wait for a while and start the volume forcefully to bring the bricks online. 
9. Repeat step 8 after multiple times (3-4) after the healing is completed 


Version-Release number of selected component (if applicable):
=============================================================

glusterfs-3.7.9-10.el7rhgs.x86_64

How reproducible:
=================

Hit the issue once, tried again but could not reproduce. 


Additional info:
================

While the above use case in progress, collected the resource consumption (CPU and Memory) every 10 sec. 

For client, Max CPU was: 181 % while the Min was: 0 % for glusterfs process. Mostly during the use case it was in range of 50-181 %

Memory throughout was in range of: 2.7 - 6.7 %

Primary server also shoot CPU for fraction of seconds to 391 % and Memory Max was : .7 %

Currently raising this bz under EC for initial analysis.

Comment 5 Pranith Kumar K 2016-08-31 21:38:09 UTC

Keeping the bug in needinfo until further updates