Bug 204364

Summary: GFS2 log flushing code looping
Product: Red Hat Enterprise Linux 5 Reporter: Russell Cattelan <cattelan>
Component: kernelAssignee: Steve Whitehouse <swhiteho>
Status: CLOSED CURRENTRELEASE QA Contact: Dean Jansa <djansa>
Severity: medium Docs Contact:
Priority: medium    
Version: 5.0CC: swhiteho
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: beta2 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2006-12-23 00:06:06 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 204760    
Attachments:
Description Flags
Another example
none
Bug fix to stop lockups in log flush code none

Description Russell Cattelan 2006-08-28 18:20:33 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.8.0.6) Gecko/20060808 Fedora/1.5.0.6-2.fc5 Firefox/1.5.0.6 pango-text

Description of problem:
bear-03.lab.msp.redhat.com login: [ 2410.524000] GFS2 (built Aug 28 2006 11:43:40) installed
[ 2474.744000] BUG: soft lockup detected on CPU#1!
[ 2474.748000]  [<c0203d97>] show_trace+0xd/0x10
[ 2474.752000]  [<c02042f5>] dump_stack+0x19/0x1b
[ 2474.760000]  [<c02430df>] softlockup_tick+0xa5/0xb9
[ 2474.764000]  [<c0228558>] run_local_timers+0x12/0x14
[ 2474.768000]  [<c02288bf>] update_process_times+0x3c/0x61
[ 2474.772000]  [<c0214132>] smp_apic_timer_interrupt+0x5f/0x69
[ 2474.780000]  [<c0203807>] apic_timer_interrupt+0x1f/0x24
[ 2474.784000]  [<f8ca12b6>] gfs2_log_flush+0x109/0x2fa [gfs2]
[ 2474.792000]  [<f8c9e50c>] inode_go_sync+0x38/0x98 [gfs2]
[ 2474.796000]  [<f8c9e0d9>] gfs2_glock_drop_th+0xc9/0x14d [gfs2]
[ 2474.800000]  [<f8c9e41a>] inode_go_drop_th+0x12/0x15 [gfs2]
[ 2474.808000]  [<f8c9ca32>] run_queue+0x10d/0x315 [gfs2]
[ 2474.812000]  [<f8c9ce26>] gfs2_glock_dq+0xa3/0xb1 [gfs2]
[ 2474.816000]  [<f8c9ce61>] gfs2_glock_dq_uninit+0xb/0x15 [gfs2]
[ 2474.824000]  [<f8cb0446>] gfs2_statfs_sync+0x201/0x20b [gfs2]
[ 2474.828000]  [<f8c9644a>] gfs2_quotad+0x4c/0x132 [gfs2]
[ 2474.836000]  [<c0230e3b>] kthread+0xc3/0xf0
[ 2474.840000]  [<c0201005>] kernel_thread_helper+0x5/0xb
[ 2484.844000] BUG: soft lockup detected on CPU#1!


It appears gfs2_qutoad gfs2_logd and pdflush is all competing 
to flush the log

Version-Release number of selected component (if applicable):


How reproducible:
Always


Steps to Reproduce:
1. write a file to gfs2
2.
3.

Actual Results:


Expected Results:


Additional info:

Comment 1 Steve Whitehouse 2006-08-31 11:29:36 UTC
Please try again with the latest fix now in my git tree:
623d93555c8884768db65ffc11509c93e50dd4db ([GFS2] Fix releasepage bug (fixes
direct i/o writes) as I suspect that this will have fixed this bug.


Comment 2 Steve Whitehouse 2006-09-12 09:50:53 UTC
Created attachment 136063 [details]
Another example

Here is another example of such a lock up. This time reproduced with postmark:

set transactions 100000
set number 100000

The back trace is different, but its probably the same issue. I suspect an
infinite loop since the lock debugging code should have caught any spinlock
recursion/deadlock.

Comment 3 Steve Whitehouse 2006-09-19 10:18:59 UTC
I've had another bash at fixing this. The git commit:
74669416f747363c14dba2ee6137540ae5a6834f has a patch which survives my postmark
test so far.

Comment 4 Steve Whitehouse 2006-09-22 09:54:06 UTC
This bug is not yet fixed. It seems to be timing related as small changes in
unrelated code mean that sometime I see this a lot, and sometimes hardly ever,
but its certainly still happening.

Comment 7 Steve Whitehouse 2006-11-24 11:28:31 UTC
Created attachment 142057 [details]
Bug fix to stop lockups in log flush code

This is the patch which has gone upstream

Comment 9 Don Zickus 2006-12-14 01:06:02 UTC
in 2.6.18-1.2876.el5

Comment 10 RHEL Program Management 2006-12-23 00:06:06 UTC
A package has been built which should help the problem described in 
this bug report. This report is therefore being closed with a resolution 
of CURRENTRELEASE. You may reopen this bug report if the solution does 
not work for you.