Bug 180641

Summary: System hung: module load/unload
Product: Red Hat Enterprise Linux 4 Reporter: James Smart <james.smart>
Component: kernelAssignee: Tom Coughlan <coughlan>
Status: CLOSED WONTFIX QA Contact: Brian Brock <bbrock>
Severity: high Docs Contact:
Priority: medium    
Version: 4.0CC: jbaron
Target Milestone: ---   
Target Release: ---   
Hardware: powerpc   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-06-20 15:55:13 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Attached is the process listing none

Description James Smart 2006-02-09 17:07:57 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.7.12) Gecko/20050915 Firefox/1.0.7

Description of problem:
The following information was gathered on a ppc64 machine running the 2.6.9-29 RHEL4 kernel.  Same problems have also been seen on the 2.6.9-27 kernel.

Run the following script:

while true
do
rmmod lpfc
sleep 5
modprobe lpfc
sleep 5
done

Somewhere between 90 and 140 iterations the system hangs.  Cannot perform any operations on the machine.  Hitting the reset button on the host drops us into xmon.  The 2 processors have the following stacks:

0:mon> t
[c00000000fffbe10] d00000000019ec3c .pcnet32_interrupt+0x9c/0x760 [pcnet32]
[c00000000fffbef0] c000000000013798 .handle_irq_event+0x80/0xf8
[c00000000fffbf90] c000000000018634 .call_handle_irq_event+0x14/0x24
[c00000002db877d0] c000000000013c5c .ppc_irq_dispatch_handler+0x278/0x398
[c00000002db87890] c000000000013e14 .do_IRQ+0x98/0x100
[c00000002db87910] c00000000000ae34 HardwareInterrupt_entry+0x8/0x54
--- Exception: 501 (Hardware Interrupt) at c0000000002f9900 ._spin_unlock_irq+0x18/0x1c
[link register   ] c0000000002f7c94 .schedule+0xad4/0xb58
[c00000002db87c00] c0000000002f764c .schedule+0x48c/0xb58 (unreliable)
[c00000002db87d10] c000000000059158 .sys_sched_yield+0xa4/0xd0
[c00000002db87db0] c0000000000881a4 .stop_machine+0xfc/0x18c
[c00000002db87e50] c000000000088280 .do_stop+0x1c/0xcc
[c00000002db87ee0] c00000000007bdec .kthread+0xcc/0x11c
[c00000002db87f90] c000000000018cb0 .kernel_thread+0x4c/0x6c
0:mon> c1
1:mon> t
[link register   ] c00000000004bd80 .__ste_allocate+0xc8/0x15c
[c00000003666ae00] c0000000000604f0 .profile_tick+0x70/0xbc (unreliable)
[c00000003666ae80] c00000000000b314 .do_ste_alloc+0x4/0x70
--- Exception: 401 (Instruction Access) at d00000000019a678 .pcnet32_dwio_write_csr+0x48/0x9c [pcnet32]
[link register   ] d00000000019eafc .pcnet32_start_xmit+0x214/0x2b8 [pcnet32]
[c00000003666b170] d00000000019ea48 .pcnet32_start_xmit+0x160/0x2b8 [pcnet32] (unreliable)
[c00000003666b240] c00000000028d2b0 .qdisc_restart+0x1ac/0x32c
[c00000003666b2e0] c0000000002763dc .dev_queue_xmit+0x14c/0x304
[c00000003666b370] c00000000027ece0 .neigh_resolve_output+0x1ec/0x278
--- Exception: 301 (Data Access) at c0000000002a16dc .ip_queue_xmit+0x4d4/0x5d8
[link register   ] c0000000002b57ac .tcp_transmit_skb+0x288/0x8ac
[c00000003666b410] c0000000002a0bc8 .ip_finish_output+0x248/0x2f4 (unreliable)
--- Exception: c00000003666b660  at c00000000fbca360
[link register   ] c00000003666b820
[c00000003666b4d0] c0000000002a16dc .ip_queue_xmit+0x4d4/0x5d8 (unreliable)
[c00000003666b660] c0000000002b5c10 .tcp_transmit_skb+0x6ec/0x8ac
[c00000003666b770] c0000000002b6b68 .tcp_write_xmit+0xc8/0x358
[c00000003666b850] c0000000002a84d4 .tcp_sendmsg+0x10ec/0x1310
[c00000003666b9c0] c0000000002d18b0 .inet_sendmsg+0x80/0xac
[c00000003666ba60] c000000000268c50 .sock_aio_write+0x184/0x1b4
[c00000003666bb90] c0000000000bdd0c .do_sync_write+0xb8/0xfc
[c00000003666bcf0] c0000000000bdeb4 .vfs_write+0x164/0x1ac
[c00000003666bd90] c0000000000bdfd4 .sys_write+0x4c/0x8c
[c00000003666be30] c000000000011280 syscall_exit+0x0/0x18
--- Exception: c01 (System Call) at 0000000007b35168
SP (ffffdaf0) is in userspace

<< I'll try to add the process list shortly. Too much for bugzilla to
<< take at initial creation

Starting with the rmmod process we see __stop_machine_run has called kthread_create to create the "kstopmachine" kernel thread and is now in wait_for_completion waiting for the thread to complete.  The kstopmachine process began execution in the do_stop function and is running on processor 0.  In looking at the stack for processor 0 we see stop_machine has called sys_sched_yield.  This means we have created kernel threads to execute stopmachine on all but the current processor and are now waiting for those to come to life (by way of checking stopmachine_thread_ack != stopmachine_num_threads).  This is where the system seems to be stuck.  We can see a second "kstopmachine" process in the process list.  I assume this is the kernel thread created in stop_machine for the second processor since it was created with kernel_thread(stopmachine, i, CLONE_KERNEL).  It appears processor 1 never began executing the stopmachine function and stopmachine_thread_ack was never incremented to match stopmachine_num_threads.  Processor 1 looks like it's doing network IO.  It looks like it got a segment table fault trying to access the write_csr function pointer.  It goes off and tries to allocate a segment table entry and there it sits.


Version-Release number of selected component (if applicable):
kernel-2.6.9-29 and kernel-2.6.9-27 - on ppc64

How reproducible:
Always

Steps to Reproduce:
1. Run the following script

while true
do
rmmod lpfc
sleep 5
modprobe lpfc
sleep 5
done


Additional info:

Comment 1 James Smart 2006-02-09 17:09:41 UTC
Created attachment 124442 [details]
Attached is the process listing

Comment 2 James Smart 2006-02-10 20:55:50 UTC
Here's another occurance. Happened on the insmod this time...

-- james

====
I saw this hang again only this time it was in insmod.  The common thread
however seems to be with pcnet32 and a segment table fault.  Here are the stacks
for each processor:

0:mon> t
[link register   ] c00000000004bd80 .__ste_allocate+0xc8/0x15c
[c00000000fffba20] 0000000043ecc6d0 (unreliable)
[c00000000fffbaa0] c00000000000b314 .do_ste_alloc+0x4/0x70
--- Exception: 401 (Instruction Access) at d00000000019889c .pcnet32_dwio_read_]
[c00000000fffbe10] d00000000019cc60 .pcnet32_interrupt+0xc0/0x760 [pcnet32]
[c00000000fffbef0] c000000000013798 .handle_irq_event+0x80/0xf8
[c00000000fffbf90] c000000000018634 .call_handle_irq_event+0x14/0x24
[c0000000003a3950] c000000000013c5c .ppc_irq_dispatch_handler+0x278/0x398
[c0000000003a3a10] c000000000013e14 .do_IRQ+0x98/0x100
[c0000000003a3a90] c00000000000ae34 HardwareInterrupt_entry+0x8/0x54
--- Exception: 501 (Hardware Interrupt) at c000000000014550 .default_idle+0x80/4
[c0000000003a3e00] c000000000014954 .cpu_idle+0x38/0x50
[c0000000003a3e70] c00000000000c488 .rest_init+0x78/0x90
[c0000000003a3ef0] c00000000036998c .start_kernel+0x284/0x29c
[c0000000003a3f90] c00000000000c0c8 .__setup_cpu_power3+0x0/0x4
0:mon> c1
1:mon> t
[c00000000fff7da0] d00000000019e258 .pcnet32_watchdog+0x44/0xac [pcnet32]
[c00000000fff7e30] c00000000006bdf8 .run_timer_softirq+0x1c8/0x230
[c00000000fff7ef0] c000000000065704 .__do_softirq+0xa0/0x17c
[c00000000fff7f90] c000000000018610 .call_do_softirq+0x14/0x24
[c00000003486a730] c0000000000144a8 .do_softirq+0x74/0x9c
[c00000003486a7c0] c0000000000153e0 .timer_interrupt+0x344/0x374
[c00000003486a8a0] c00000000000a2b4 Decrementer_common+0xb4/0x100
--- Exception: 901 (Decrementer) at c0000000001d1590 .serial_in+0x154/0x190
[c00000003486ac10] c0000000001d47b4 .serial8250_console_write+0xa0/0x360
[c00000003486acc0] c00000000005ee78 .__call_console_drivers+0x94/0xc0
[c00000003486ad50] c00000000005f0a0 .call_console_drivers+0x120/0x16c
[c00000003486adf0] c00000000005f58c .release_console_sem+0x80/0x13c
[c00000003486ae90] c00000000005f40c .vprintk+0x1cc/0x208
[c00000003486af20] c00000000005f22c .printk+0x30/0x44
[c00000003486afa0] d00000000002d4ec .sd_probe+0x344/0x394 [sd_mod]
[c00000003486b050] c0000000001d7d90 .bus_match+0x94/0xd8
[c00000003486b0e0] c0000000001d7e4c .device_attach+0x78/0xf8
[c00000003486b170] c0000000001d8318 .bus_add_device+0xc0/0x138
[c00000003486b210] c0000000001d6384 .device_add+0xa4/0x1ec
[c00000003486b2b0] d00000000006ea8c .scsi_sysfs_add_sdev+0x17c/0x38c [scsi_mod]
[c00000003486b360] d00000000006cb24 .scsi_add_lun+0x388/0x3b4 [scsi_mod]
[c00000003486b3f0] d00000000006cd94 .scsi_probe_and_add_lun+0x194/0x250 [scsi_m]
[c00000003486b4c0] d00000000006d418 .scsi_report_lun_scan+0x38c/0x40c [scsi_mod]
[c00000003486b5d0] d00000000006d7c4 .scsi_scan_target+0xb4/0x114 [scsi_mod]
[c00000003486b680] d00000000006d89c .scsi_scan_channel+0x78/0xd8 [scsi_mod]
[c00000003486b720] d00000000006da24 .scsi_scan_host_selected+0x128/0x1c0 [scsi_]
[c00000003486b7d0] d000000000106224 .lpfc_pci_probe_one+0x5a8/0x700 [lpfc]
[c00000003486b890] c000000000179714 .pci_device_probe_static+0x6c/0xa8
[c00000003486b920] c000000000179794 .__pci_device_probe+0x44/0x7c
[c00000003486b9b0] c000000000179808 .pci_device_probe+0x3c/0x6c
[c00000003486ba40] c0000000001d7d90 .bus_match+0x94/0xd8
[c00000003486bad0] c0000000001d7f3c .driver_attach+0x70/0xe4
[c00000003486bb60] c0000000001d8750 .bus_add_driver+0xf4/0x158
[c00000003486bc00] c0000000001d8e4c .driver_register+0x38/0x4c
[c00000003486bc80] c000000000179c3c .pci_register_driver+0x80/0xcc
[c00000003486bd10] d0000000001066b4 .lpfc_init+0x54/0x84 [lpfc]
[c00000003486bd90] c000000000082b48 .sys_init_module+0x24c/0x4b4
[c00000003486be30] c000000000011280 syscall_exit+0x0/0x18
--- Exception: c01 (System Call) at 000000000ff32190
SP (ffffe510) is in userspace

cpu1 is doing the insmod.  While this is going on cpu0 begins processing a
hardware interrupt for pcnet32.  pcnet32_interrupt will acquire a local spin
lock and then takes a segment table fault.  While all this is going on cpu1
takes a timer interrupt and begins processing pcnet32_watchdog.  This will block
trying to acquire the same spin lock that cpu0 acquired in pcnet32_interrupt and
still owns.  At this point we appear to be deadlocked.

If you look at the rmmod hang below it's very similiar.  cpu1 took a segment
table fault while in pcnet32_start_xmit and cpu0 is blocked in pcnet32_interrupt
trying to acquire the spin lock that cpu1 owns.

At this point I'm suspecting a problem with pcnet32.

Comment 3 Tom Coughlan 2006-06-02 17:18:47 UTC
David, does this look familliar at all? Any chance of a fix in more recent kernels?

Comment 4 Jiri Pallich 2012-06-20 15:55:13 UTC
Thank you for submitting this issue for consideration in Red Hat Enterprise Linux. The release for which you requested us to review is now End of Life. 
Please See https://access.redhat.com/support/policy/updates/errata/

If you would like Red Hat to re-consider your feature request for an active release, please re-open the request via appropriate support channels and provide additional supporting details about the importance of this issue.