Bug 194211

Summary: NFS file write causes Kernel BUG at lib/radix-tree.c:378
Product: [Fedora] Fedora Reporter: Matt Amos <matthew.amos>
Component: kernelAssignee: Dave Jones <davej>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Brian Brock <bbrock>
Severity: medium Docs Contact:
Priority: medium    
Version: 5CC: pfrields, wtogami
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2006-11-24 22:48:53 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Matt Amos 2006-06-06 08:56:39 UTC
Description of problem: 
There was a bug in one of our codes that was causing multiple processes to write
the same file on an NFS mount. This went undetected on our x86, mips, sun and
alpha machines, but managed to crash our new x86_64 machine. When the code is
run the machine locks up totally, not responsive to network or local inputs and
needs to be rebooted.

Version-Release number of selected component (if applicable): 
Initially 2.6.15-1.2054_FC5, but upgraded to 2.6.16-1.2122_FC5. Unfortunately,
it didn't help.

How reproducible: 
Run our code. I've attempted to capture the behaviour in a smaller program (ours
is ~20 MLOC), but without success so far.

Steps to Reproduce:
1. Boot server.
2. Run code.
  
Actual results:
When booting with console=ttyS0, the following output is observed:

----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at lib/radix-tree.c:378
invalid opcode: 0000 [1] SMP 
last sysfs file: /block/hda/removable
CPU 2 
Modules linked in: nfs lockd nfs_acl ppdev autofs4 ipv6 hidp rfcomm l2cap
bluetooth sunrpc video button battery ac lp parport_pc parport floppy nvram
snd_intel8x0 snd_ac97_codec snd_ac97_bus ohci_hcd snd_seq_dummy i2c_nforce2
snd_seq_oss snd_seq_midi_event snd_seq ohci1394 ehci_hcd snd_seq_device ieee1394
i2c_core snd_pcm_oss snd_mixer_oss sg forcedeth snd_pcm snd_timer snd soundcore
snd_page_alloc dm_snapshot dm_zero dm_mirror dm_mod ext3 jbd sata_nv libata
mptspi mptscsih sd_mod scsi_mod mptbase
Pid: 3267, comm: solar_debug Not tainted 2.6.16-1.2122_FC5 #1
RIP: 0010:[<ffffffff80200d07>] <ffffffff80200d07>{radix_tree_tag_set+91}
RSP: 0018:ffff810142b39bc0  EFLAGS: 00010046
RAX: 0000000000000005 RBX: ffff810006377040 RCX: ffff81013fe5b690
RDX: 0000000000000008 RSI: 0000000000000005 RDI: 0000000000000005
RBP: ffff81013fe18dc8 R08: 0000000000000000 R09: 0000000000000001
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
R13: 0000000000000282 R14: 0000000000000000 R15: 0000000000000000
FS:  00002aaaab65f4e0(0000) GS:ffff810154ce5c40(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00002aaaad4e1008 CR3: 00000001429a4000 CR4: 00000000000006e0
Process solar_debug (pid: 3267, threadinfo ffff810142b38000, task ffff81014430e7a0)
Stack: ffffffff8016163d 0000000000000050 ffff8101440dd840 ffff8102496d37c0 
       ffff81024e9a59c0 0000000000000000 ffffffff884cd7f4 ffff81013fe18dc8 
       00000004884cdf32 0000000000000001 
Call Trace: <ffffffff8016163d>{test_set_page_writeback+115}
       <ffffffff884cd7f4>{:nfs:nfs_flush_inode+1104}
<ffffffff884ce4f7>{:nfs:nfs_writepages+97}
       <ffffffff80160f23>{do_writepages+37}
<ffffffff8015c106>{__filemap_fdatawrite_range+114}
       <ffffffff8015c2fd>{filemap_write_and_wait+25}
<ffffffff884c77f2>{:nfs:nfs_sync_mapping+39}
       <ffffffff884c784a>{:nfs:nfs_revalidate_mapping+54}
<ffffffff884c5c70>{:nfs:nfs_file_write+153}
       <ffffffff8017e556>{do_sync_write+199}
<ffffffff80145ebe>{autoremove_wake_function+0}
       <ffffffff8017ee50>{vfs_write+206} <ffffffff8017f402>{sys_write+69}
       <ffffffff8010a7ba>{system_call+126}

Code: 0f 0b 68 16 fd 36 80 c2 7a 01 41 83 ea 06 41 ff c9 45 85 c9 
RIP <ffffffff80200d07>{radix_tree_tag_set+91} RSP <ffff810142b39bc0>
 <3>Debug: sleeping function called from invalid context at include/linux/rwsem.h:43
in_atomic():0, irqs_disabled():1

Call Trace: <ffffffff80133beb>{profile_task_exit+21}
       <ffffffff801358ab>{do_exit+32} <ffffffff8010c1ec>{kernel_math_error+0}
       <ffffffff8010c779>{do_invalid_op+163}
<ffffffff80200d07>{radix_tree_tag_set+91}
       <ffffffff884ca362>{:nfs:nfs_create_request+223}
<ffffffff80202e0e>{find_next_bit+89}
       <ffffffff8012a07a>{find_busiest_group+329} <ffffffff8010b6f9>{error_exit+0}
       <ffffffff80200d07>{radix_tree_tag_set+91}
<ffffffff8016163d>{test_set_page_writeback+115}
       <ffffffff884cd7f4>{:nfs:nfs_flush_inode+1104}
<ffffffff884ce4f7>{:nfs:nfs_writepages+97}
       <ffffffff80160f23>{do_writepages+37}
<ffffffff8015c106>{__filemap_fdatawrite_range+114}
       <ffffffff8015c2fd>{filemap_write_and_wait+25}
<ffffffff884c77f2>{:nfs:nfs_sync_mapping+39}
       <ffffffff884c784a>{:nfs:nfs_revalidate_mapping+54}
<ffffffff884c5c70>{:nfs:nfs_file_write+153}
       <ffffffff8017e556>{do_sync_write+199}
<ffffffff80145ebe>{autoremove_wake_function+0}
       <ffffffff8017ee50>{vfs_write+206} <ffffffff8017f402>{sys_write+69}
       <ffffffff8010a7ba>{system_call+126}
NMI Watchdog detected LOCKUP on CPU 0
CPU 0 
Modules linked in: nfs lockd nfs_acl ppdev autofs4 ipv6 hidp rfcomm l2cap
bluetooth sunrpc video button battery ac lp parport_pc parport floppy nvram
snd_intel8x0 snd_ac97_codec snd_ac97_bus ohci_hcd snd_seq_dummy i2c_nforce2
snd_seq_oss snd_seq_midi_event snd_seq ohci1394 ehci_hcd snd_seq_device ieee1394
i2c_core snd_pcm_oss snd_mixer_oss sg forcedeth snd_pcm snd_timer snd soundcore
snd_page_alloc dm_snapshot dm_zero dm_mirror dm_mod ext3 jbd sata_nv libata
mptspi mptscsih sd_mod scsi_mod mptbase
Pid: 3293, comm: solar_debug Not tainted 2.6.16-1.2122_FC5 #1
RIP: 0010:[<ffffffff80204b75>] <ffffffff80204b75>{_raw_write_lock+148}
RSP: 0018:ffff810142aa5c98  EFLAGS: 00000006
RAX: 0000000077c0d400 RBX: ffff81013fe18de0 RCX: 0000000000000001
RDX: 000000000295e148 RSI: 00000000000000a8 RDI: ffff81013fe18de0
RBP: ffff81013fe18dc8 R08: ffff810142aa5bc8 R09: 0000000000000010
R10: ffff810006377040 R11: ffff810142aa5c78 R12: ffff81013fe18de0
R13: ffff81013fe18dc8 R14: 0000000000000007 R15: ffffffffffffffff
FS:  00002aaaab65f4e0(0000) GS:ffffffff8050c000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00000000006ea000 CR3: 00000001435bd000 CR4: 00000000000006e0
Process solar_debug (pid: 3293, threadinfo ffff810142aa4000, task ffff810142be0820)
Stack: 0000000000000282 ffffffff8033b3a5 ffff810153e65940 ffffffff801616cc 
       0000000000000000 ffff810153e65940 0000000000000006 0000000000000001 
       ffff81013fe18dc8 ffffffff80162f04 
Call Trace: <ffffffff8033b3a5>{_write_lock_irqsave+9}
       <ffffffff801616cc>{test_clear_page_dirty+56}
<ffffffff80162f04>{invalidate_inode_pages2_range+274}
       <ffffffff884cdc62>{:nfs:nfs_sync_inode+100}
<ffffffff884c7852>{:nfs:nfs_revalidate_mapping+62}
       <ffffffff884c5c70>{:nfs:nfs_file_write+153}
<ffffffff8017e556>{do_sync_write+199}
       <ffffffff80145ebe>{autoremove_wake_function+0}
<ffffffff8017ee50>{vfs_write+206}
       <ffffffff8017f905>{fget_light+24} <ffffffff8017f402>{sys_write+69}
       <ffffffff8010a7ba>{system_call+126}

Code: 48 69 05 70 4f 1b 00 fa 00 00 00 48 39 c2 72 d6 85 c9 74 ce 
console shuts up ...
 <3>Debug: sleeping function called from invalid context at include/linux/rwsem.h:43
in_atomic():1, irqs_disabled():1

Call Trace: <NMI> <ffffffff80133beb>{profile_task_exit+21}
       <ffffffff801358ab>{do_exit+32} <ffffffff8033bb34>{__die+0}
       <ffffffff8033c326>{nmi_watchdog_tick+239}
<ffffffff8033c06d>{default_do_nmi+136}
       <ffffffff8033c401>{do_nmi+69} <ffffffff8033b8bb>{nmi+127}
       <ffffffff80204b75>{_raw_write_lock+148} <EOE>
<ffffffff8033b3a5>{_write_lock_irqsave+9}
       <ffffffff801616cc>{test_clear_page_dirty+56}
<ffffffff80162f04>{invalidate_inode_pages2_range+274}
       <ffffffff884cdc62>{:nfs:nfs_sync_inode+100}
<ffffffff884c7852>{:nfs:nfs_revalidate_mapping+62}
       <ffffffff884c5c70>{:nfs:nfs_file_write+153}
<ffffffff8017e556>{do_sync_write+199}
       <ffffffff80145ebe>{autoremove_wake_function+0}
<ffffffff8017ee50>{vfs_write+206}
       <ffffffff8017f905>{fget_light+24} <ffffffff8017f402>{sys_write+69}
       <ffffffff8010a7ba>{system_call+126}
Kernel panic - not syncing: Aiee, killing interrupt handler!

Call Trace: <NMI> <ffffffff80131fd2>{panic+134}
<ffffffff8033b37e>{_spin_unlock_irq+12}
       <ffffffff8033ad2e>{__down_read+52} <ffffffff8033b3be>{_spin_lock_irqsave+9}
       <ffffffff8020145e>{__up_read+19} <ffffffff80135917>{do_exit+140}
       <ffffffff8033bb34>{__die+0} <ffffffff8033c326>{nmi_watchdog_tick+239}
       <ffffffff8033c06d>{default_do_nmi+136} <ffffffff8033c401>{do_nmi+69}
       <ffffffff8033b8bb>{nmi+127} <ffffffff80204b75>{_raw_write_lock+148} <EOE>
       <ffffffff8033b3a5>{_write_lock_irqsave+9}
<ffffffff801616cc>{test_clear_page_dirty+56}
       <ffffffff80162f04>{invalidate_inode_pages2_range+274}
       <ffffffff884cdc62>{:nfs:nfs_sync_inode+100}
<ffffffff884c7852>{:nfs:nfs_revalidate_mapping+62}
       <ffffffff884c5c70>{:nfs:nfs_file_write+153}
<ffffffff8017e556>{do_sync_write+199}
       <ffffffff80145ebe>{autoremove_wake_function+0}
<ffffffff8017ee50>{vfs_write+206}
       <ffffffff8017f905>{fget_light+24} <ffffffff8017f402>{sys_write+69}
       <ffffffff8010a7ba>{system_call+126}
 <7>APIC error on CPU0: 00(08)
APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)
... message continues until machine is powered off ...
(NOTE: "solar_debug", is the name of the code executable)

Expected results:
The contents of the file might be undefined, but the kernel should not crash.

Additional info:
Test machine is a 2 processor, 2 core Opteron 270. Install was from FC5 x86_64
"bordeaux" DVD. Output of cpuinfo:
processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 15
model           : 33
model name      : Dual Core AMD Opteron(tm) Processor 270
stepping        : 2
cpu MHz         : 2009.269
cache size      : 1024 KB
physical id     : 0
siblings        : 2
core id         : 0
cpu cores       : 2
fpu             : yes
fpu_exception   : yes
cpuid level     : 1
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt lm 3dnowext 3d
now pni lahf_lm cmp_legacy
bogomips        : 4025.06
TLB size        : 1024 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp

Comment 1 Dave Jones 2006-10-16 23:56:00 UTC
A new kernel update has been released (Version: 2.6.18-1.2200.fc5)
based upon a new upstream kernel release.

Please retest against this new kernel, as a large number of patches
go into each upstream release, possibly including changes that
may address this problem.

This bug has been placed in NEEDINFO state.
Due to the large volume of inactive bugs in bugzilla, if this bug is
still in this state in two weeks time, it will be closed.

Should this bug still be relevant after this period, the reporter
can reopen the bug at any time. Any other users on the Cc: list
of this bug can request that the bug be reopened by adding a
comment to the bug.

In the last few updates, some users upgrading from FC4->FC5
have reported that installing a kernel update has left their
systems unbootable. If you have been affected by this problem
please check you only have one version of device-mapper & lvm2
installed.  See bug 207474 for further details.

If this bug is a problem preventing you from installing the
release this version is filed against, please see bug 169613.

If this bug has been fixed, but you are now experiencing a different
problem, please file a separate bug for the new problem.

Thank you.

Comment 2 Dave Jones 2006-11-24 22:48:53 UTC
This bug has been mass-closed along with all other bugs that
have been in NEEDINFO state for several months.

Due to the large volume of inactive bugs in bugzilla, this
is the only method we have of cleaning out stale bug reports
where the reporter has disappeared.

If you can reproduce this bug after installing all the
current updates, please reopen this bug.

If you are not the reporter, you can add a comment requesting
it be reopened, and someone will get to it asap.

Thank you.