174123 – BUG: write-lock lockup on CPU#4, cc1/13482, d5f92abc (Not tainted)

Bug 174123 - BUG: write-lock lockup on CPU#4, cc1/13482, d5f92abc (Not tainted)

Summary: BUG: write-lock lockup on CPU#4, cc1/13482, d5f92abc (Not tainted)

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	4
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Stephen Tweedie
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2005-11-24 19:38 UTC by Adrian Reber
Modified:	2007-11-30 22:11 UTC (History)
CC List:	2 users (show)
Fixed In Version:	2.6.14-1.1644_FC4smp
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2006-05-05 21:17:43 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
dmesg dump (32.85 KB, text/plain) 2005-11-24 19:38 UTC, Adrian Reber	no flags	Details
dmesg (85.63 KB, text/plain) 2005-12-13 06:59 UTC, Adrian Reber	no flags	Details
output of dmesg (123.56 KB, text/plain) 2006-01-09 16:18 UTC, Adrian Reber	no flags	Details
Show Obsolete (2) View All

Description Adrian Reber 2005-11-24 19:38:02 UTC

BUG: write-lock lockup on CPU#4, cc1/13482, d5f92abc (Not tainted)
BUG: write-lock lockup on CPU#3, cc1/13502, d5f92abc (Not tainted)
 [<c01df047>] __write_lock_debug+0xb4/0xdd
 [<c01df0af>] _raw_write_lock+0x3f/0x7c
 [<c0144aed>] add_to_page_cache+0x34/0xaf
 [<c0185dc6>] mpage_readpages+0xed/0x13d
 [<c011a476>] activate_task+0x8f/0x9e
 [<c011aa80>] try_to_wake_up+0x6e/0x343
 [<f9cd21ec>] linvfs_readpages+0x0/0x15 [xfs]
 [<c014b57a>] read_pages+0x2a/0xf7
 [<f9cd1fd2>] linvfs_get_block+0x0/0x35 [xfs]
 [<c0149245>] __alloc_pages+0x109/0x469
 [<c014b7b0>] __do_page_cache_readahead+0x169/0x16e
 [<c0145e26>] filemap_nopage+0x31d/0x39b
 [<c0154bf7>] do_no_page+0x96/0x30b
 [<c0155089>] __handle_mm_fault+0x13e/0x1e5
 [<c031ede4>] do_page_fault+0x274/0x700
 [<c011625c>] smp_apic_timer_interrupt+0xc1/0xca
 [<c031eb70>] do_page_fault+0x0/0x700
 [<c010457f>] error_code+0x4f/0x54
 [<c01df047>] __write_lock_debug+0xb4/0xdd
 [<c01df0af>] _raw_write_lock+0x3f/0x7c
 [<c0144aed>] add_to_page_cache+0x34/0xaf
 [<c0185dc6>] mpage_readpages+0xed/0x13d
 [<f9cd1fd2>] linvfs_get_block+0x0/0x35 [xfs]
 [<c0148bd0>] rmqueue_bulk+0x77/0x81
 [<f9cd21ec>] linvfs_readpages+0x0/0x15 [xfs]
 [<c014b57a>] read_pages+0x2a/0xf7
 [<f9cd1fd2>] linvfs_get_block+0x0/0x35 [xfs]
 [<c0149245>] __alloc_pages+0x109/0x469
 [<c014b7b0>] __do_page_cache_readahead+0x169/0x16e
 [<c0145e26>] filemap_nopage+0x31d/0x39b
 [<c0154bf7>] do_no_page+0x96/0x30b
 [<c0155089>] __handle_mm_fault+0x13e/0x1e5
 [<c031ede4>] do_page_fault+0x274/0x700
 [<c011625c>] smp_apic_timer_interrupt+0xc1/0xca
 [<c031eb70>] do_page_fault+0x0/0x700
 [<c010457f>] error_code+0x4f/0x54

$ uname -a
Linux rhlx01.fht-esslingen.de 2.6.14-1.1637_FC4smp #1 SMP Wed Nov 9 18:34:11 EST
2005 i686 i686 i386 GNU/Linux

Server froze for about 30 minutes completely, even the clock didn't change. Now
it seems to be back at normal.

Comment 1 Adrian Reber 2005-11-24 19:38:02 UTC

Created attachment 121461 [details]
dmesg dump

Comment 2 Adrian Reber 2005-12-01 07:53:57 UTC

and again:

Nov 30 18:05:10 rhlx01 kernel: BUG: write-lock lockup on CPU#4, msgfmt/9352,
e4115510 (Not tainted)
Nov 30 18:05:10 rhlx01 kernel: BUG: write-lock lockup on CPU#2, msgfmt/9353,
e4115510 (Not tainted)
Nov 30 18:05:10 rhlx01 kernel:  [<c01df047>] __write_lock_debug+0xb4/0xdd
Nov 30 18:05:10 rhlx01 kernel:  [<c01df0af>] _raw_write_lock+0x3f/0x7c
Nov 30 18:05:10 rhlx01 kernel:  [<c0144aed>] add_to_page_cache+0x34/0xaf
Nov 30 18:05:10 rhlx01 kernel:  [<c0185dc6>] mpage_readpages+0xed/0x13d
Nov 30 18:05:10 rhlx01 kernel:  [<c01bf58c>] avc_has_perm_noaudit+0x26/0xd1
Nov 30 18:05:10 rhlx01 kernel:  [<f9cd21ec>] linvfs_readpages+0x0/0x15 [xfs]
Nov 30 18:05:10 rhlx01 kernel:  [<c014b57a>] read_pages+0x2a/0xf7
Nov 30 18:05:10 rhlx01 kernel:  [<f9cd1fd2>] linvfs_get_block+0x0/0x35 [xfs]
Nov 30 18:05:10 rhlx01 kernel:  [<c0149245>] __alloc_pages+0x109/0x469
Nov 30 18:05:10 rhlx01 kernel:  [<f9cda405>] vn_revalidate+0x4c/0x58 [xfs]
Nov 30 18:05:10 rhlx01 kernel:  [<c014b7b0>] __do_page_cache_readahead+0x169/0x16e
Nov 30 18:05:11 rhlx01 kernel:  [<c0145e26>] filemap_nopage+0x31d/0x39b
Nov 30 18:05:11 rhlx01 kernel:  [<c0154bf7>] do_no_page+0x96/0x30b
Nov 30 18:05:11 rhlx01 kernel:  [<c0155089>] __handle_mm_fault+0x13e/0x1e5
Nov 30 18:05:12 rhlx01 kernel:  [<c031ede4>] do_page_fault+0x274/0x700
Nov 30 18:05:12 rhlx01 kernel:  [<c031eb70>] do_page_fault+0x0/0x700
Nov 30 18:05:12 rhlx01 kernel:  [<c010457f>] error_code+0x4f/0x54
Nov 30 18:05:13 rhlx01 kernel:  [<c01df047>] __write_lock_debug+0xb4/0xdd
Nov 30 18:05:13 rhlx01 kernel:  [<c01df0af>] _raw_write_lock+0x3f/0x7c
Nov 30 18:05:13 rhlx01 kernel:  [<c0144aed>] add_to_page_cache+0x34/0xaf
Nov 30 18:05:13 rhlx01 kernel:  [<c0185dc6>] mpage_readpages+0xed/0x13d
Nov 30 18:05:13 rhlx01 kernel:  [<c0148bd0>] rmqueue_bulk+0x77/0x81
Nov 30 18:05:13 rhlx01 kernel:  [<f9cd21ec>] linvfs_readpages+0x0/0x15 [xfs]
Nov 30 18:05:13 rhlx01 kernel:  [<c014b57a>] read_pages+0x2a/0xf7
Nov 30 18:05:13 rhlx01 kernel:  [<f9cd1fd2>] linvfs_get_block+0x0/0x35 [xfs]
Nov 30 18:05:13 rhlx01 kernel:  [<c0149245>] __alloc_pages+0x109/0x469
Nov 30 18:05:13 rhlx01 kernel:  [<f9cda405>] vn_revalidate+0x4c/0x58 [xfs]
Nov 30 18:05:13 rhlx01 kernel:  [<c014b7b0>] __do_page_cache_readahead+0x169/0x16e
Nov 30 18:05:14 rhlx01 kernel:  [<c0145e26>] filemap_nopage+0x31d/0x39b
Nov 30 18:05:14 rhlx01 kernel:  [<c0154bf7>] do_no_page+0x96/0x30b
Nov 30 18:05:14 rhlx01 kernel:  [<c0155089>] __handle_mm_fault+0x13e/0x1e5
Nov 30 18:05:14 rhlx01 kernel:  [<c031ede4>] do_page_fault+0x274/0x700
Nov 30 18:05:14 rhlx01 kernel:  [<c031eb70>] do_page_fault+0x0/0x700
Nov 30 18:05:14 rhlx01 kernel:  [<c010457f>] error_code+0x4f/0x54

The system crashed 3 hours later without any message. This is still
2.6.14-1.1637_FC4smp. We will now reboot with 2.6.14-1.1644_FC4smp and see if
this BUG still appears.

Comment 3 Dave Jones 2005-12-01 09:01:22 UTC

please report XFS problems to its upstream maintainer. <nathans>
As an unsupported filesystem in Fedora, it'll get fixed faster that way.

Comment 4 Adrian Reber 2005-12-13 06:56:54 UTC

I am reopening this bug because it happened again without xfs. I will attach a
dmesg copy.

Comment 5 Adrian Reber 2005-12-13 06:59:41 UTC

Created attachment 122168 [details]
dmesg

I just saw that we are not using the newest version (2.6.14-1.1644_FC4smp) but
still 2.6.14-1.1637_FC4smp.

Comment 6 Stephen Tweedie 2005-12-13 13:48:24 UTC

There's not much to go on here.  Things locked up because 5 tasks are stuck in
add_to_page_cache() at

		write_lock_irq(&mapping->tree_lock);

but there's no sign of who owns that lock or why it has not been released.  Some
things that might help would be enabling NMI lockup detection, or obtaining a
full alt-sysrq-t after the lockup.

It still has XFS mounted, by the looks of things; it's certainly possible that
it's still an XFS bug and we're just hitting a lock that XFS has not released.  

What sort of workload are you running?  Have there been any other suspicious log
messages that might help?

Comment 7 Adrian Reber 2005-12-13 14:02:22 UTC

There is no XFS mounted anymore. It used to be but we moved the data a bit
around and reformatted the XFS partition with ext3. But the machine has not
rebooted since the last XFS partition has been unmounted.
The server is a mostly doing mirrors for various projects. We usually have over
150MBit/s on our network interface and not much else is happening.
We will reboot the server probably today with 2.6.14-1.1644_FC4smp and see if
this will happen again.
As all cases happened when nobody was around it is hard to do alt-sysrq-t.

Comment 8 Stephen Tweedie 2005-12-13 18:15:54 UTC

OK, thanks; let me know how that kernel goes.  

Did this behaviour suddenly start with one particular kernel?

Comment 9 Adrian Reber 2005-12-13 18:46:21 UTC

We have rebooted and 2.6.14-1.1644_FC4smp is now running; I will post here if we
see this bug again.

This has only happened with 2.6.14-1.1637_FC4smp and was never seen before.

Comment 10 Adrian Reber 2006-01-09 16:18:08 UTC

Created attachment 122954 [details]
output of dmesg

The same bug has happened again. See attachment.

We are currently running 2.6.14-1.1644_FC4smp with following modules loaded:

Module			Size  Used by
loop		       20937  0 
ipv6		      271009  396 
ipt_REJECT		9921  3 
iptable_filter		7105  1 
ip_tables	       25665  2 ipt_REJECT,iptable_filter
dm_mod		       61533  0 
video		       20293  0 
button		       10705  0 
battery 	       13509  0 
ac			8901  0 
ohci_hcd	       26721  0 
cfi_probe	       10049  0 
gen_probe		7617  1 cfi_probe
scb2_flash		8781  0 
mtdcore 	       11913  1 scb2_flash
chipreg 		7489  2 cfi_probe,scb2_flash
map_funcs		5953  1 scb2_flash
e100		       41281  0 
mii			9409  1 e100
e1000		      108077  0 
floppy		       66181  0 
qla2300 	      128705  0 
qla2xxx 	      129053  2 qla2300
scsi_transport_fc      32321  1 qla2xxx
ext3		      135753  11 
jbd		       62037  1 ext3
aic7xxx 	      154741  0 
scsi_transport_spi     25153  1 aic7xxx
i2o_block	       17485  13 
i2o_core	       47721  1 i2o_block
sd_mod		       22849  2 
scsi_mod	      140009  5
qla2xxx,scsi_transport_fc,aic7xxx,scsi_transport_spi,sd_mod

Comment 11 Dave Jones 2006-02-03 05:57:36 UTC

This is a mass-update to all currently open kernel bugs.

A new kernel update has been released (Version: 2.6.15-1.1830_FC4)
based upon a new upstream kernel release.

Please retest against this new kernel, as a large number of patches
go into each upstream release, possibly including changes that
may address this problem.

This bug has been placed in NEEDINFO_REPORTER state.
Due to the large volume of inactive bugs in bugzilla, if this bug is
still in this state in two weeks time, it will be closed.

Should this bug still be relevant after this period, the reporter
can reopen the bug at any time. Any other users on the Cc: list
of this bug can request that the bug be reopened by adding a
comment to the bug.

If this bug is a problem preventing you from installing the
release this version is filed against, please see bug 169613.

Thank you.

Comment 12 John Thacker 2006-05-05 21:17:43 UTC

Closing due to previous comment, and note by reporter that
2.6.14-1.1644_FC4smp solved it.

Note You need to log in before you can comment on or make changes to this bug.