Bug 733418

Summary: gfs2 hung task - kernel 2.6.18-194.32.1
Product: Red Hat Enterprise Linux 5 Reporter: Charlie Brady <charlieb-fedora-bugzilla>
Component: kernelAssignee: Steve Whitehouse <swhiteho>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: unspecified Docs Contact:
Priority: low    
Version: 5.5CC: adas, anprice, bmarzins, rpeterso, rwheeler, swhiteho
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-08-30 13:30:20 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
gfs stack backtraces from node1
none
gfs stack backtraces from node2 none

Description Charlie Brady 2011-08-25 16:34:29 UTC
This seen on a CentOS system - reporting here in case it is useful for you.

kernel-PAE-2.6.18-194.32.1.el5
gfs2-utils-0.1.62-20.el5
cman-2.0.115-34.el5_5.4

System booted Jul 28 07:53, and operated reportedly without problems until Aug 22 12:03:55, when syslog showed:

Aug 22 12:03:55 node2 kernel: INFO: task ls:20713 blocked for more than 120 seconds.
Aug 22 12:03:55 node2 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 22 12:03:55 node2 kernel: ls            D 0007B6BF  2552 20713  11209                     (NOTLB)
Aug 22 12:03:55 node2 kernel:        f111fd98 00000086 77574858 0007b6bf 00000000 f6ca5400 f111fd58 00000008 
Aug 22 12:03:55 node2 kernel:        e7cc2aa0 775889f2 0007b6bf 0001419a 0000000a e7cc2bac c44576a8 eccab200 
Aug 22 12:03:55 node2 kernel:        f8ef721c c4458048 e7013cf0 00000000 d42f4090 00000018 e7013cd0 ffffffff 
Aug 22 12:03:55 node2 kernel: Call Trace:
Aug 22 12:03:55 node2 kernel: klogd 1.4.1, ---------- state change ---------- 
Aug 22 12:03:55 node2 kernel: Inspecting /boot/System.map-2.6.18-194.32.1.el5PAE
Aug 22 12:03:55 node2 kernel: Loaded 30070 symbols from /boot/System.map-2.6.18-194.32.1.el5PAE.
Aug 22 12:03:55 node2 kernel: Symbols match kernel version 2.6.18.
Aug 22 12:03:55 node2 kernel: No module symbols loaded - kernel modules not enabled. 
Aug 22 12:03:55 node2 kernel:  [<f8ef721c>] gdlm_bast+0x0/0x78 [lock_dlm]
Aug 22 12:03:55 node2 kernel:  [<f8fd510a>] just_schedule+0x5/0x8 [gfs2]
Aug 22 12:03:55 node2 kernel:  [__wait_on_bit+51/88] __wait_on_bit+0x33/0x58
Aug 22 12:03:55 node2 kernel:  [<c061cacd>] __wait_on_bit+0x33/0x58
Aug 22 12:03:55 node2 kernel:  [<f8fd5105>] just_schedule+0x0/0x8 [gfs2]
Aug 22 12:03:55 node2 kernel:  [<f8fd5105>] just_schedule+0x0/0x8 [gfs2]
Aug 22 12:03:55 node2 kernel:  [out_of_line_wait_on_bit+98/106] out_of_line_wait_on_bit+0x62/0x6a
Aug 22 12:03:55 node2 kernel:  [<c061cb54>] out_of_line_wait_on_bit+0x62/0x6a
Aug 22 12:03:55 node2 kernel:  [wake_bit_function+0/60] wake_bit_function+0x0/0x3c
Aug 22 12:03:55 node2 kernel:  [<c0436524>] wake_bit_function+0x0/0x3c
Aug 22 12:03:55 node2 kernel:  [<f8fd50fe>] gfs2_glock_wait+0x27/0x2e [gfs2]
Aug 22 12:03:55 node2 kernel:  [<f8fe3182>] gfs2_lookup+0x4b/0x90 [gfs2]
Aug 22 12:03:55 node2 kernel:  [<f8fe317b>] gfs2_lookup+0x44/0x90 [gfs2]
Aug 22 12:03:55 node2 kernel:  [do_lookup+194/372] do_lookup+0xc2/0x174
Aug 22 12:03:55 node2 kernel:  [<c0480f6f>] do_lookup+0xc2/0x174
Aug 22 12:03:55 node2 kernel:  [__link_path_walk+2170/3403] __link_path_walk+0x87a/0xd4b
Aug 22 12:03:55 node2 kernel:  [<c0482d54>] __link_path_walk+0x87a/0xd4b
Aug 22 12:03:55 node2 kernel:  [link_path_walk+56/149] link_path_walk+0x38/0x95
Aug 22 12:03:55 node2 kernel:  [<c048325d>] link_path_walk+0x38/0x95
Aug 22 12:03:55 node2 kernel:  [do_path_lookup+537/639] do_path_lookup+0x219/0x27f
Aug 22 12:03:55 node2 kernel:  [<c048361e>] do_path_lookup+0x219/0x27f
Aug 22 12:03:55 node2 kernel:  [__user_walk_fd+41/58] __user_walk_fd+0x29/0x3a
Aug 22 12:03:55 node2 kernel:  [<c0483d78>] __user_walk_fd+0x29/0x3a
Aug 22 12:03:55 node2 kernel:  [vfs_lstat_fd+18/57] vfs_lstat_fd+0x12/0x39
Aug 22 12:03:55 node2 kernel:  [<c047d3ef>] vfs_lstat_fd+0x12/0x39
Aug 22 12:03:55 node2 kernel:  [_atomic_dec_and_lock+42/68] _atomic_dec_and_lock+0x2a/0x44
Aug 22 12:03:55 node2 kernel:  [<c04ebd1e>] _atomic_dec_and_lock+0x2a/0x44
Aug 22 12:03:55 node2 kernel:  [mntput_no_expire+17/106] mntput_no_expire+0x11/0x6a
Aug 22 12:03:55 node2 kernel:  [<c048dd56>] mntput_no_expire+0x11/0x6a
Aug 22 12:03:55 node2 kernel:  [sys_lstat64+15/35] sys_lstat64+0xf/0x23
Aug 22 12:03:56 node2 kernel:  [<c047d45b>] sys_lstat64+0xf/0x23
Aug 22 12:03:56 node2 kernel:  [sysenter_past_esp+86/121] sysenter_past_esp+0x56/0x79
Aug 22 12:03:56 node2 kernel:  [<c0404ead>] sysenter_past_esp+0x56/0x79
Aug 22 12:03:56 node2 kernel:  =======================

GFS accessing processes blocked.

Other hung trask reports continued for a few minutes:

Aug 22 12:03:55 node2 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 22 12:05:55 node2 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 22 12:05:56 node2 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 22 12:07:55 node2 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 22 12:07:56 node2 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 22 12:07:57 node2 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 22 12:07:58 node2 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 22 12:07:59 node2 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 22 12:09:55 node2 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 22 12:09:56 node2 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

System was power cycled at 12:57. After reboot and rejoining the cluster (two nodes) both nodes started to show hung tasks. Cluster was taken down, GFS2 file system was fscked, and cluster has been up and stable since.

I will attach more of the hung task stacktraces as an attachment. Let me know what more information I can provide. Unfortunately I was not hands-on at the time of the incident, so I don't know what additional observation and intervention was done. I don't know what the lock status was.

Comment 1 Charlie Brady 2011-08-25 16:50:25 UTC
Created attachment 519927 [details]
gfs stack backtraces from node1

Comment 2 Charlie Brady 2011-08-25 16:51:08 UTC
Created attachment 519928 [details]
gfs stack backtraces from node2

Comment 3 Charlie Brady 2011-08-25 18:00:22 UTC
This issue may have already been addressed in 5.6 update kernels:

* Thu Jun 16 2011 Phillip Lougher <plougher> [2.6.18-238.15.1.el5]
...
- [fs] gfs2: fix resource group bitmap corruption (Robert S Peterson) [711519 690555]
...

* Wed Jun 01 2011 Phillip Lougher <plougher> [2.6.18-238.13.1.el5]
- [fs] gfs2: fix processes waiting on already-available inode glock (Phillip Lougher) [709767 694669]

* Sat May 07 2011 Phillip Lougher <plougher> [2.6.18-238.12.1.el5]
...
- [fs] gfs2: fix filesystem hang caused by incorrect lock order (Robert S Peterson) [688855 656032]
- [fs] gfs2: restructure reclaim of unlinked dinodes (Phillip Lougher) [688855 656032]
- [fs] gfs2: unlock on gfs2_trans_begin error (Robert S Peterson) [688855 656032]
...
* Fri Apr 15 2011 Phillip Lougher <plougher> [2.6.18-238.10.1.el5]
- [fs] gfs2: creating large files suddenly slow to a crawl (Robert S Peterson) [690239 683155]
...
* Fri Mar 04 2011 Jiri Pirko <jpirko> [2.6.18-238.7.1.el5]
...
- [fs] gfs2: remove iopen glocks from cache on delete fail (Benjamin Marzinski) [675909 666080]
...
* Tue Jan 04 2011 Jiri Pirko <jpirko> [2.6.18-238.1.1.el5]
...
- [fs] gfs2: fix statfs error after gfs2_grow (Robert S Peterson) [666792 660661]
...

Comment 4 Steve Whitehouse 2011-08-26 11:14:24 UTC
The stack trace looks like waiting on an inode glock during a stat system call. The key to tracking down the issue is to look at the glock dumps on each node and see which node/process is holding the glock in question, and thus preventing the ls process from getting the glock required for that stat.

Note that it can be useful to use ls -f in order to disable stat()ing of files by ls which dramatically improved ls performance for larger directories.

It is quite likely that one of the fixes in comment #3 will resolve this, with this one being the most likely:

* Wed Jun 01 2011 Phillip Lougher <plougher> [2.6.18-238.13.1.el5]
- [fs] gfs2: fix processes waiting on already-available inode glock (Phillip
Lougher) [709767 694669]

Comment 5 Steve Whitehouse 2011-08-30 12:56:29 UTC
If there is no possibility of getting the glock dump, then I think we are going to be pretty much stuck so far as providing a root cause goes.

The report is useful anyway as it will be kept on file in case of any similar future events, however if there is not further information available we will have to close this bug as insufficient data.

Comment 6 Charlie Brady 2011-08-30 13:26:50 UTC
> If there is no possibility of getting the glock dump, then I think we
> are going to be pretty much stuck so far as providing a root cause goes.

That's what I thought. I will apply the cluebat when the relevant person returns from vacation. I'm pretty sure that there will be no record of that information, and I don't think he looked at the glock information.

> The report is useful anyway as it will be kept on file in case of any
> similar future events, ...

That's exactly why I made the report.

> ... however if there is not further information available we will
> have to close this bug as insufficient data.

Understood.

Comment 7 Steve Whitehouse 2011-08-30 13:30:20 UTC
Ok, thanks for the info!