Description of problem: Attempting to run amcheck on fenlason-rhide.lab.boston.redhat.com triggers a kernel BUG. Subsequent attempts hang. Version-Release number of selected component (if applicable): 2.6.15-1.1909_FC5smp How reproducible: Always Steps to Reproduce: 1. reboot fenlason-rhide.lab.boston.redhat.com 2. run "amcheck JaysCube" as root 3. Look at dmesg, syslog output 4. run "amcheck JaysCube" again. 5. Notice that it hangs. Actual results: eth0: no IPv6 routers present target1:0:1: FAST-20 WIDE SCSI 40.0 MB/s ST (50 ns, offset 15) st0: Block limits 2 - 16777215 bytes. kfree_debugcheck: out of range ptr fffffffch. ------------[ cut here ]------------ kernel BUG at mm/slab.c:2436! invalid opcode: 0000 [#1] SMP last sysfs file: /block/sda/sda1/size Modules linked in: ipv6 ppdev autofs4 rfcomm l2cap bluetooth sunrpc lp parport_pc parport floppy nvram uhci_hcd snd_cmipci gameport snd_seq_dummy matroxfb_base snd_seq_oss matroxfb_DAC1064 matroxfb_accel snd_seq_midi_event matroxfb_Ti3026 matroxfb_g450 snd_seq g450_pll matroxfb_misc snd_pcm_oss snd_mixer_oss snd_pcm snd_page_alloc snd_opl3_lib snd_timer snd_hwdep st snd_mpu401_uart snd_rawmidi snd_seq_device snd matrox_w1 soundcore wire e100 i2c_piix4 i2c_core mii dm_snapshot dm_zero dm_mirror dm_mod ext3 jbd aic7xxx scsi_transport_spi sd_mod scsi_mod CPU: 0 EIP: 0060:[<c015f0f6>] Not tainted VLI EFLAGS: 00010092 (2.6.15-1.1909_FC5smp) EIP is at kfree_debugcheck+0x20/0x51 eax: 00000031 ebx: 0003ffff ecx: 00000000 edx: 00000082 esi: fffffffc edi: fffffffc ebp: 00000246 esp: d4a59df8 ds: 007b es: 007b ss: 0068 Process amcheck (pid: 2680, threadinfo=d4a59000 task=d4ad18d0) Stack: <0>c0316ea4 fffffffc 00000000 dffdbb00 c015f84c c0149ade 00000000 dffdbb00 dffd7b70 00000246 c016134f dffdc6f8 c172419c 00000000 d49e2a00 c0149ade dcd38310 c172419c dcd38310 c0167ffd 00000a00 00000000 c0167eb8 e0840356 Call Trace: [<c015f84c>] cache_free_debugcheck+0x1b/0x1b9 [<c0149ade>] mempool_free+0x5f/0x63 [<c016134f>] kmem_cache_free+0x2a/0x5c [<c0149ade>] mempool_free+0x5f/0x63 [<c0167ffd>] bio_free+0x25/0x30 [<c0167eb8>] bio_put+0x28/0x29 [<e0840356>] scsi_execute_async+0x13f/0x317 [scsi_mod] [<e094d677>] st_do_scsi+0x1d3/0x228 [st] [<e094f4f9>] st_sleep_done+0x0/0x3a [st] [<e0951439>] st_read+0x2fb/0x78b [st] [<e095113e>] st_read+0x0/0x78b [st] [<c0163f8d>] vfs_read+0x9f/0x13e [<c01643d9>] sys_read+0x3c/0x63 [<c0103d21>] syscall_call+0x7/0xb Code: f0 09 03 6e 31 c0 58 5b 5e 5f c3 56 53 89 c6 8d 98 00 00 00 40 c1 eb 0c 3b 1d 80 43 48 c0 72 15 50 68 a4 6e 31 c0 e8 e1 48 fc ff <0f> 0b 84 09 03 6e 31 c0 58 5a 6b c3 28 03 05 90 43 48 c0 8b 00 Continuing in 1 seconds. <3>Debug: sleeping function called from invalid context at include/linux/rwsem.h:43 in_atomic():0, irqs_disabled():1 [<c01246a1>] profile_task_exit+0x13/0x48 [<c0125f66>] do_exit+0x1c/0x6fc [<c01239eb>] printk+0x14/0x18 [<c0105272>] show_stack+0x0/0xa [<c0105868>] do_invalid_op+0x0/0x9d [<c01058f9>] do_invalid_op+0x91/0x9d [<c015f0f6>] kfree_debugcheck+0x20/0x51 [<c012339d>] release_console_sem+0x199/0x1a1 [<c0123843>] vprintk+0x195/0x329 [<c0149b19>] mempool_alloc+0x37/0xd3 [<c01d2975>] cfq_set_request+0x318/0x43a [<c01048bf>] error_code+0x4f/0x54 [<c015f0f6>] kfree_debugcheck+0x20/0x51 [<c015f84c>] cache_free_debugcheck+0x1b/0x1b9 [<c0149ade>] mempool_free+0x5f/0x63 [<c016134f>] kmem_cache_free+0x2a/0x5c [<c0149ade>] mempool_free+0x5f/0x63 [<c0167ffd>] bio_free+0x25/0x30 [<c0167eb8>] bio_put+0x28/0x29 [<e0840356>] scsi_execute_async+0x13f/0x317 [scsi_mod] [<e094d677>] st_do_scsi+0x1d3/0x228 [st] [<e094f4f9>] st_sleep_done+0x0/0x3a [st] [<e0951439>] st_read+0x2fb/0x78b [st] [<e095113e>] st_read+0x0/0x78b [st] [<c0163f8d>] vfs_read+0x9f/0x13e [<c01643d9>] sys_read+0x3c/0x63 [<c0103d21>] syscall_call+0x7/0xb Expected results: No kernel BUG. No hang. Additional info: Ping me if you need the root password
Also fails with 2.6.15-1.1914_FC5smp
With 2.6.15-1.1915_FC5smp the second amcheck gets EBUSY instead of hanging. I guess that's progress. :-)
Jay, can you try 2.6.15-1.1926 (or higher) ?
Same problem with 2.6.15-1.1926_FC5smp
Still fails with 2.6.15-1.1939_FC5smp
I wonder if the scsi_execute_in_process_context() changes in rc4 fix this. Can you try 1963 or higher ?
actually, looking again, this looks like a double free, not just a sleeping whilst atomic problem. bad eyes on my part.
2.6.15-1.1991_FC5smp still fails st0: Block limits 2 - 16777215 bytes. kfree_debugcheck: out of range ptr fffffffch. ------------[ cut here ]------------ kernel BUG at mm/slab.c:2490! invalid opcode: 0000 [#1] SMP last sysfs file: /block/sda/sda1/size Modules linked in: nfsd exportfs lockd nfs_acl ipv6 ppdev autofs4 rfcomm l2cap b luetooth sunrpc lp parport_pc parport floppy nvram uhci_hcd st sg snd_cmipci gam eport snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_pcm_oss snd_mixer _oss snd_pcm snd_page_alloc matroxfb_base snd_opl3_lib matroxfb_DAC1064 snd_time r matroxfb_accel snd_hwdep matroxfb_Ti3026 snd_mpu401_uart matroxfb_g450 snd_raw midi g450_pll snd_seq_device snd matroxfb_misc soundcore matrox_w1 wire e100 mii i2c_piix4 i2c_core dm_snapshot dm_zero dm_mirror dm_mod ext3 jbd aic7xxx scsi_t ransport_spi sd_mod scsi_mod CPU: 0 EIP: 0060:[<c015f311>] Not tainted VLI EFLAGS: 00010086 (2.6.15-1.1991_FC5smp #1) EIP is at kfree_debugcheck+0x20/0x51 eax: 00000031 ebx: 0003ffff ecx: d4018dfc edx: c0315f6d esi: fffffffc edi: fffffffc ebp: 00000246 esp: d4018df8 ds: 007b es: 007b ss: 0068 Process amcheck (pid: 2568, threadinfo=d4018000 task=d46d03b0) Stack: <0>c0315f6d fffffffc 00000000 dffdeb00 c015fa51 c0149d2e 00000000 dffdeb0 0 dffdcb70 00000246 c0161559 dfd6c9d0 c1770170 00000000 de613598 c0149d2e d463cea0 c1770170 d463cea0 c016822d 00000a00 00000000 c01680e8 e0842418 Call Trace: [<c015fa51>] cache_free_debugcheck+0x1b/0x1b9 [<c0149d2e>] mempool_free+0x5 f/0x63 [<c0161559>] kmem_cache_free+0x2a/0x5c [<c0149d2e>] mempool_free+0x5f/0x63 [<c016822d>] bio_free+0x25/0x30 [<c01680e8>] bio_put+0x28/0x29 [<e0842418>] scsi_execute_async+0x13f/0x317 [scsi_mod] [<e0a38677>] st_do_s csi+0x1d3/0x228 [st] [<e0a3a4f9>] st_sleep_done+0x0/0x3a [st] [<e0a3c439>] st_read+0x2fb/0x78b [ st] [<e0a3c13e>] st_read+0x0/0x78b [st] [<c01641c9>] vfs_read+0x9f/0x13e [<c0164615>] sys_read+0x3c/0x63 [<c0103d81>] syscall_call+0x7/0xb Code: 26 0a cc 5e 31 c0 58 5b 5e 5f c3 56 53 89 c6 8d 98 00 00 00 40 c1 eb 0c 3b 1d 00 23 48 c0 72 15 50 68 6d 5f 31 c0 e8 16 48 fc ff <0f> 0b ba 09 cc 5e 31 c0 58 5a 6b c3 28 03 05 10 23 48 c0 8b 00 Continuing in 1 seconds. <3>Debug: sleeping function called from invalid context at include/linux/rwsem. h:43 in_atomic():0, irqs_disabled():1 [<c0124881>] profile_task_exit+0x13/0x48 [<c01261a3>] do_exit+0x1c/0x6fc [<c010526c>] register_die_notifier+0x0/0x2f [<c010584e>] do_invalid_op+0x0/ 0x9d [<c01058df>] do_invalid_op+0x91/0x9d [<c015f311>] kfree_debugcheck+0x20/0x5 1 [<c0123993>] vprintk+0x195/0x329 [<c0149d69>] mempool_alloc+0x37/0xd3 [<c0160050>] kmem_cache_alloc+0x7f/0x89 [<c0149d69>] mempool_alloc+0x37/0xd 3 [<c0149d69>] mempool_alloc+0x37/0xd3 [<c01d2d5a>] cfq_set_request+0x318/0x4 3a [<c01048a7>] error_code+0x4f/0x54 [<c015f311>] kfree_debugcheck+0x20/0x51 [<c015fa51>] cache_free_debugcheck+0x1b/0x1b9 [<c0149d2e>] mempool_free+0x5 f/0x63 [<c0161559>] kmem_cache_free+0x2a/0x5c [<c0149d2e>] mempool_free+0x5f/0x63 [<c016822d>] bio_free+0x25/0x30 [<c01680e8>] bio_put+0x28/0x29 [<e0842418>] scsi_execute_async+0x13f/0x317 [scsi_mod] [<e0a38677>] st_do_s csi+0x1d3/0x228 [st] [<e0a3a4f9>] st_sleep_done+0x0/0x3a [st] [<e0a3c439>] st_read+0x2fb/0x78b [ st] [<e0a3c13e>] st_read+0x0/0x78b [st] [<c01641c9>] vfs_read+0x9f/0x13e [<c0164615>] sys_read+0x3c/0x63 [<c0103d81>] syscall_call+0x7/0xb
With 2.6.15-1.2041_FC5smp the behavior has changed. Instead of BUGging immediately in amcheck, the kernel return -EBUSY on an attempt to read the tape drive, then crashes some time later. I'm attaching the strace from amcheck. At the time of the crash, the serial console said: Unable to handle kernel NULL pointer dereference at virtual address 00 000014 printing eip: c015e8f4 *pde = 1e46f001 Oops: 0000 [#1] SMP last sysfs file: /block/sda/sda1/size Modules linked in: nfsd exportfs lockd nfs_acl ipv6 ppdev autofs4 hidp rfcomm l2 cap bluetooth sunrpc lp parport_pc parport floppy nvram uhci_hcd st sg snd_cmipc i gameport snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_pcm_oss snd_ mixer_oss snd_pcm snd_page_alloc snd_opl3_lib snd_timer snd_hwdep snd_mpu401_uar t snd_rawmidi snd_seq_device snd soundcore matroxfb_base matroxfb_DAC1064 matrox fb_accel matroxfb_Ti3026 matroxfb_g450 g450_pll matroxfb_misc matrox_w1 wire i2c _piix4 i2c_core e100 mii dm_snapshot dm_zero dm_mirror dm_mod ext3 jbd aic7xxx s csi_transport_spi sd_mod scsi_mod CPU: 1 EIP: 0060:[<c015e8f4>] Not tainted VLI EFLAGS: 00010016 (2.6.15-1.2041_FC5smp #1) EIP is at free_block+0x40/0xe5 eax: 00a00000 ebx: dffd5f20 ecx: 00000010 edx: 00000000 esi: 00000000 edi: dffdeb40 ebp: 00000005 esp: dffeeef8 ds: 007b es: 007b ss: 0068 Process events/1 (pid: 9, threadinfo=dffee000 task=c1700050) Stack: <0>c1329678 0000001e dffd6820 dffd6820 0000001e dffd6800 00000000 c015e9f 5 00000000 dffdeb40 dffd5eb0 dffd5f20 dffdeb40 00000000 c015fe6b 00000000 dffd5f4c dffdec14 c1511280 c1511284 c16da120 00000282 c0131488 c015fdce Call Trace: [<c015e9f5>] drain_array_locked+0x5c/0x7b [<c015fe6b>] cache_reap+0x9d/0x19 1 [<c0131488>] run_workqueue+0x7f/0xba [<c015fdce>] cache_reap+0x0/0x191 [<c0131c75>] worker_thread+0x0/0x117 [<c0131d5b>] worker_thread+0xe6/0x117 [<c011da07>] default_wake_function+0x0/0xc [<c0134589>] kthread+0x9d/0xc9 [<c01344ec>] kthread+0x0/0xc9 [<c0102005>] kernel_thread_helper+0x5/0xb Code: 00 00 00 8b 44 24 08 8b 34 a8 8d 86 00 00 00 40 c1 e8 0c 6b c0 28 8b 15 10 f3 47 c0 8b 4c 10 24 8b 54 24 20 8b 9c 97 90 00 00 00 <8b> 51 04 39 0a 74 08 0f 0b a7 00 03 fb 30 c0 8b 01 39 48 04 74 Continuing in 1 seconds. <3>Debug: sleeping function called from invalid context at include/linux/rwsem. h:43 in_atomic():0, irqs_disabled():1 [<c0124c15>] profile_task_exit+0x13/0x48 [<c0126537>] do_exit+0x1c/0x6f5 [<c010526c>] register_die_notifier+0x0/0x2f [<c02f32e0>] do_page_fault+0x3b 1/0x5e2 [<c015e8f4>] free_block+0x40/0xe5 [<c02f2f2f>] do_page_fault+0x0/0x5e2 [<c01048a7>] error_code+0x4f/0x54 [<c015e8f4>] free_block+0x40/0xe5 [<c015e9f5>] drain_array_locked+0x5c/0x7b [<c015fe6b>] cache_reap+0x9d/0x19 1 [<c0131488>] run_workqueue+0x7f/0xba [<c015fdce>] cache_reap+0x0/0x191 [<c0131c75>] worker_thread+0x0/0x117 [<c0131d5b>] worker_thread+0xe6/0x117 [<c011da07>] default_wake_function+0x0/0xc [<c0134589>] kthread+0x9d/0xc9 [<c01344ec>] kthread+0x0/0xc9 [<c0102005>] kernel_thread_helper+0x5/0xb BUG: spinlock lockup on CPU#0, syslogd/1704, dffd5f4c (Not tainted) [<c01d6723>] _raw_spin_lock+0xb9/0xd7 [<c015ec35>] cache_alloc_refill+0x63/ 0x489 [<c015ebc8>] kmem_cache_alloc+0x50/0x5a [<c01490d1>] mempool_alloc+0x37/0xd 3 [<c01664f5>] bio_alloc_bioset+0x9b/0xf3 [<c0166558>] bio_alloc+0xb/0x17 [<c01634f5>] submit_bh+0x78/0x102 [<c0165e8f>] __block_write_full_page+0x1f 3/0x2fa [<e090aec9>] ext3_get_block+0x0/0x6c [ext3] [<c016605c>] block_write_full_p age+0xc6/0xcf [<e090aec9>] ext3_get_block+0x0/0x6c [ext3] [<e090c569>] ext3_ordered_write page+0xd0/0x162 [ext3] [<e0909c3a>] bget_one+0x0/0x7 [ext3] [<c0180fcd>] mpage_writepages+0x1a1/0x 2fd [<e090c499>] ext3_ordered_writepage+0x0/0x162 [ext3] [<c014b7b4>] do_writep ages+0x2b/0x32 [<c01473aa>] __filemap_fdatawrite_range+0x65/0x71 [<c01473d8>] filemap_fdat awrite+0xf/0x13 [<c0163f65>] do_fsync+0x48/0xbb [<c0103d81>] syscall_call+0x7/0xb
Created attachment 126066 [details] strace output of amcheck command, showing EBUSY return from tape read
Still broken with 2.6.16-1.2083_FC6smp
Are we still seeing this with T3 ?
Jay, could you check this with the latest rawhide kernel? If it still happens it should probably be a RHEL5 blocker, since quite a few customers have tape drives.
It won't be easy--I updated Amanda, and the new version doesn't use the same access patterns to the tape drive. I'll try reinstalling the old version and see what I can get.
It's no longer reproducing with amanda-2.4.5p1-3.2 or amanda-2.5.0p2-4 on the 2.6.18-1.2726.fc6 kernel. Closing.