204364 – GFS2 log flushing code looping

Bug 204364 - GFS2 log flushing code looping

Summary: GFS2 log flushing code looping

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	5.0
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Steve Whitehouse
QA Contact:	Dean Jansa
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	204760
TreeView+	depends on / blocked

Reported:	2006-08-28 18:20 UTC by Russell Cattelan
Modified:	2007-11-30 22:07 UTC (History)
CC List:	1 user (show)
Fixed In Version:	beta2
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2006-12-23 00:06:06 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Another example (4.35 KB, text/plain) 2006-09-12 09:50 UTC, Steve Whitehouse	no flags	Details
Bug fix to stop lockups in log flush code (861 bytes, patch) 2006-11-24 11:28 UTC, Steve Whitehouse	no flags	Details \| Diff
View All

Description Russell Cattelan 2006-08-28 18:20:33 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.8.0.6) Gecko/20060808 Fedora/1.5.0.6-2.fc5 Firefox/1.5.0.6 pango-text

Description of problem:
bear-03.lab.msp.redhat.com login: [ 2410.524000] GFS2 (built Aug 28 2006 11:43:40) installed
[ 2474.744000] BUG: soft lockup detected on CPU#1!
[ 2474.748000]  [<c0203d97>] show_trace+0xd/0x10
[ 2474.752000]  [<c02042f5>] dump_stack+0x19/0x1b
[ 2474.760000]  [<c02430df>] softlockup_tick+0xa5/0xb9
[ 2474.764000]  [<c0228558>] run_local_timers+0x12/0x14
[ 2474.768000]  [<c02288bf>] update_process_times+0x3c/0x61
[ 2474.772000]  [<c0214132>] smp_apic_timer_interrupt+0x5f/0x69
[ 2474.780000]  [<c0203807>] apic_timer_interrupt+0x1f/0x24
[ 2474.784000]  [<f8ca12b6>] gfs2_log_flush+0x109/0x2fa [gfs2]
[ 2474.792000]  [<f8c9e50c>] inode_go_sync+0x38/0x98 [gfs2]
[ 2474.796000]  [<f8c9e0d9>] gfs2_glock_drop_th+0xc9/0x14d [gfs2]
[ 2474.800000]  [<f8c9e41a>] inode_go_drop_th+0x12/0x15 [gfs2]
[ 2474.808000]  [<f8c9ca32>] run_queue+0x10d/0x315 [gfs2]
[ 2474.812000]  [<f8c9ce26>] gfs2_glock_dq+0xa3/0xb1 [gfs2]
[ 2474.816000]  [<f8c9ce61>] gfs2_glock_dq_uninit+0xb/0x15 [gfs2]
[ 2474.824000]  [<f8cb0446>] gfs2_statfs_sync+0x201/0x20b [gfs2]
[ 2474.828000]  [<f8c9644a>] gfs2_quotad+0x4c/0x132 [gfs2]
[ 2474.836000]  [<c0230e3b>] kthread+0xc3/0xf0
[ 2474.840000]  [<c0201005>] kernel_thread_helper+0x5/0xb
[ 2484.844000] BUG: soft lockup detected on CPU#1!


It appears gfs2_qutoad gfs2_logd and pdflush is all competing 
to flush the log

Version-Release number of selected component (if applicable):


How reproducible:
Always


Steps to Reproduce:
1. write a file to gfs2
2.
3.

Actual Results:


Expected Results:


Additional info:

Comment 1 Steve Whitehouse 2006-08-31 11:29:36 UTC

Please try again with the latest fix now in my git tree:
623d93555c8884768db65ffc11509c93e50dd4db ([GFS2] Fix releasepage bug (fixes
direct i/o writes) as I suspect that this will have fixed this bug.

Comment 2 Steve Whitehouse 2006-09-12 09:50:53 UTC

Created attachment 136063 [details]
Another example

Here is another example of such a lock up. This time reproduced with postmark:

set transactions 100000
set number 100000

The back trace is different, but its probably the same issue. I suspect an
infinite loop since the lock debugging code should have caught any spinlock
recursion/deadlock.

Comment 3 Steve Whitehouse 2006-09-19 10:18:59 UTC

I've had another bash at fixing this. The git commit:
74669416f747363c14dba2ee6137540ae5a6834f has a patch which survives my postmark
test so far.

Comment 4 Steve Whitehouse 2006-09-22 09:54:06 UTC

This bug is not yet fixed. It seems to be timing related as small changes in
unrelated code mean that sometime I see this a lot, and sometimes hardly ever,
but its certainly still happening.

Comment 7 Steve Whitehouse 2006-11-24 11:28:31 UTC

Created attachment 142057 [details]
Bug fix to stop lockups in log flush code

This is the patch which has gone upstream

Comment 9 Don Zickus 2006-12-14 01:06:02 UTC

in 2.6.18-1.2876.el5

Comment 10 RHEL Program Management 2006-12-23 00:06:06 UTC

A package has been built which should help the problem described in 
this bug report. This report is therefore being closed with a resolution 
of CURRENTRELEASE. You may reopen this bug report if the solution does 
not work for you.

Note You need to log in before you can comment on or make changes to this bug.