1368073 – [client-io-threads]: process not responding for more than 120 sec and its hung

Bug 1368073 - [client-io-threads]: process not responding for more than 120 sec and its hung

Summary: [client-io-threads]: process not responding for more than 120 sec and its hung

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	disperse
Sub Component:
Version:	rhgs-3.1
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Sunil Kumar Acharya
QA Contact:	Nag Pavan Chilakam
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-08-18 09:57 UTC by Rahul Hinduja
Modified:	2018-04-16 18:15 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-04-16 18:15:58 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Rahul Hinduja 2016-08-18 09:57:16 UTC

Description of problem:
=======================

While verification of client-io-threads from Fuse mount on EC volume, observed the following on dmesg of client : 

[88440.551165] INFO: task crefi:1972 blocked for more than 120 seconds.
[88440.551223] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[88440.551283] crefi           D ffff8800aed7ccb0     0  1972   1968 0x00000080
[88440.551287]  ffff88003ec57c70 0000000000000082 ffff880036bbb980 ffff88003ec57fd8
[88440.551291]  ffff88003ec57fd8 ffff88003ec57fd8 ffff880036bbb980 ffff8800aed7cca8
[88440.551293]  ffff8800aed7ccac ffff880036bbb980 00000000ffffffff ffff8800aed7ccb0
[88440.551296] Call Trace:
[88440.551305]  [<ffffffff8163b9e9>] schedule_preempt_disabled+0x29/0x70
[88440.551309]  [<ffffffff816396e5>] __mutex_lock_slowpath+0xc5/0x1c0
[88440.551312]  [<ffffffff81638b4f>] mutex_lock+0x1f/0x2f
[88440.551316]  [<ffffffff811eb9af>] do_last+0x28f/0x1270
[88440.551320]  [<ffffffff811c11ce>] ? kmem_cache_alloc_trace+0x1ce/0x1f0
[88440.551323]  [<ffffffff811ee672>] path_openat+0xc2/0x490
[88440.551362]  [<ffffffffa01e7cd4>] ? xfs_iunlock+0xa4/0x130 [xfs]
[88440.551383]  [<ffffffffa01d45fa>] ? xfs_free_eofblocks+0xda/0x270 [xfs]
[88440.551387]  [<ffffffff811efe3b>] do_filp_open+0x4b/0xb0
[88440.551390]  [<ffffffff811fc9c7>] ? __alloc_fd+0xa7/0x130
[88440.551394]  [<ffffffff811dd7e3>] do_sys_open+0xf3/0x1f0
[88440.551397]  [<ffffffff811dd8fe>] SyS_open+0x1e/0x20
[88440.551401]  [<ffffffff81645909>] system_call_fastpath+0x16/0x1b

Scenario:
=========

1. Create EC volume (3x(8+3)) from 11 node cluster.
2. Set event-threads to 4
3. Enable client-io-threads
4. Mount the volume on client via Fuse
5. Open 10 Sessions of the client (Different terminals)
6. Create IO from 5 session using crefi on same directory. Ensure to use multi threaded crefi using -T 10. 
crefi -b 4 -d 4 -n 20 --multi --random --min 1K --max 100M -T 10 -t text --fop=create /mnt/fuse/multiple_threads/
7. From the other 5 sessions, stat all the content of directory every 30 sec.
for i in {1..100}; do echo "This is iteration $i" ; find * | xargs stat ; sleep 30 ;done
8. While step 7 and step 8 are inprogress, bring down 1 bricks from each subvolume. Wait for a while and start the volume forcefully to bring the bricks online. 
9. Repeat step 8 after multiple times (3-4) after the healing is completed 


Version-Release number of selected component (if applicable):
=============================================================

glusterfs-3.7.9-10.el7rhgs.x86_64

How reproducible:
=================

Hit the issue once, tried again but could not reproduce. 


Additional info:
================

While the above use case in progress, collected the resource consumption (CPU and Memory) every 10 sec. 

For client, Max CPU was: 181 % while the Min was: 0 % for glusterfs process. Mostly during the use case it was in range of 50-181 %

Memory throughout was in range of: 2.7 - 6.7 %

Primary server also shoot CPU for fraction of seconds to 391 % and Memory Max was : .7 %

Currently raising this bz under EC for initial analysis.

Comment 5 Pranith Kumar K 2016-08-31 21:38:09 UTC

Keeping the bug in needinfo until further updates

Note You need to log in before you can comment on or make changes to this bug.