Red Hat Bugzilla – Bug 845143
Soft lockup on Ionics Stratus ARM computer
Last modified: 2013-04-26 23:33:21 EDT
Description of problem: I am attempting to build activeMQ via maven / java, on an ARM-based IONICS Stratus plug computer.
The computer hangs part way into the build, and I get a kernel error message about a soft lockup. It looks like this is originating somewhere in the networking stack:
[ 1069.060253] BUG: soft lockup - CPU#0 stuck for 22s! [java:686]
[ 1069.066116] Modules linked in: nfs nfs_acl auth_rpcgss fscache lockd sunrpc vfat fat mtdchar ofpart libertas_sdio libertas orion_nand cfg80211 rfkill nand nand_ecc mv643xx_eth nand_ids mtd mv_cesa usb_storage mmc_block mvsdio mmc_core [last unloaded: scsi_wait_scan]
[ 1069.091482] Pid: 686, comm: java
[ 1069.096116] CPU: 0 Not tainted (3.4.6-2.fc17.armv5tel.kirkwood #1)
[ 1069.102684] PC is at dma_cache_maint_page+0xb0/0xe0
[ 1069.107583] LR is at dma_cache_maint_page+0xa8/0xe0
[ 1069.112485] pc : [<c000f724>] lr : [<c000f71c>] psr: 80000013
[ 1069.112490] sp : c7475c58 ip : c022eed0 fp : 00000fff00
[ 1069.124026] r10: 000002ec r9 : c066cb88 r8 : 0a23300000
[ 1069.129276] r7 : 00000001 r6 : 00000076 r5 : c0b109c0 r4 : 00000abc
[ 1069.135834] r3 : 00000000 r2 : c066cb88 r1 : 00000076 r0 : dde4eabc
[ 1069.142392] Flags: Nzcv IRQs on FIQs on Mode SVC_32 ISA ARM Segment user
[ 1069.149559] Control: 0005397f Table: 1dd6c000 DAC: 00000015
[ 1069.155349] [<c000f1f8>] (unwind_backtrace+0x0/0x124) from [<c0077e10>] (watchdog_timer_fn+0xf0/0x144)
[ 1069.164706] [<c0077e10>] (watchdog_timer_fn+0xf0/0x144) from [<c0038ab4>] (__run_hrtimer+0xb0/0x1d4)
[ 1069.173886] [<c0038ab4>] (__run_hrtimer+0xb0/0x1d4) from [<c00392d0>] (hrtimer_interrupt+0x10c/0x254)
[ 1069.183161] [<c00392d0>] (hrtimer_interrupt+0x10c/0x254) from [<c001604c>] (orion_timer_interrupt+0x24/0x34)
[ 1069.193046] [<c001604c>] (orion_timer_interrupt+0x24/0x34) from [<c007859c>] (handle_irq_event_percpu+0x38/0x240)
[ 1069.203360] [<c007859c>] (handle_irq_event_percpu+0x38/0x240) from [<c00787d4>] (handle_irq_event+0x30/0x40)
[ 1069.213240] [<c00787d4>] (handle_irq_event+0x30/0x40) from [<c007ae1c>] (handle_level_irq+0xbc/0xd0)
[ 1069.222421] [<c007ae1c>] (handle_level_irq+0xbc/0xd0) from [<c0077fc8>] (generic_handle_irq+0x28/0x38)
[ 1069.231778] [<c0077fc8>] (generic_handle_irq+0x28/0x38) from [<c0009b84>] (handle_IRQ+0x68/0x8c)
[ 1069.240611] [<c0009b84>] (handle_IRQ+0x68/0x8c) from [<c042c574>] (__irq_svc+0x34/0x80)
[ 1069.248662] [<c042c574>] (__irq_svc+0x34/0x80) from [<c000f724>] (dma_cache_maint_page+0xb0/0xe0)
[ 1069.257583] [<c000f724>] (dma_cache_maint_page+0xb0/0xe0) from [<c000fbdc>] (___dma_page_dev_to_cpu+0x7c/0xcc)
[ 1069.267642] [<c000fbdc>] (___dma_page_dev_to_cpu+0x7c/0xcc) from [<c022cda8>] (dma_async_memcpy_buf_to_pg+0x134/0x1a8)
[ 1069.278398] [<c022cda8>] (dma_async_memcpy_buf_to_pg+0x134/0x1a8) from [<c022e1f8>] (dma_memcpy_to_iovec+0xd4/0x158)
[ 1069.288981] [<c022e1f8>] (dma_memcpy_to_iovec+0xd4/0x158) from [<c036c7e0>] (dma_skb_copy_datagram_iovec+0x5c/0x1d4)
[ 1069.299564] [<c036c7e0>] (dma_skb_copy_datagram_iovec+0x5c/0x1d4) from [<c0392e9c>] (tcp_recvmsg+0x5a8/0x9f4)
[ 1069.309541] [<c0392e9c>] (tcp_recvmsg+0x5a8/0x9f4) from [<c03b0ae0>] (inet_recvmsg+0x48/0x5c)
[ 1069.318122] [<c03b0ae0>] (inet_recvmsg+0x48/0x5c) from [<c0345d60>] (sock_recvmsg+0xc0/0xe8)
[ 1069.326611] [<c0345d60>] (sock_recvmsg+0xc0/0xe8) from [<c03477b0>] (sys_recvfrom+0xa0/0x10c)
[ 1069.335182] [<c03477b0>] (sys_recvfrom+0xa0/0x10c) from [<c0347838>] (sys_recv+0x1c/0x20)
[ 1069.343404] [<c0347838>] (sys_recv+0x1c/0x20) from [<c0008c40>] (ret_fast_syscall+0x0/0x2c)
Version-Release number of selected component (if applicable):
Linux version 3.4.6-2.fc17.armv5tel.kirkwood (firstname.lastname@example.org)
How reproducible: 100%
Steps to Reproduce:
1. Download the source for activemq-parent-5.5.1
2. attempt to build via: mvn clean install
Actual results: soft lock up
Expected results: no lock up
Additional info: I have a 500 GB USB hard-drive connected, providing the root filesystem. The kernel / initrd are loaded over tftp. I'll add an attachment showing the entire console output from bootup to the hang.
Created attachment 601844 [details]
Complete console log showing the lockup
I'll add a "me too" - I have been having this problem (a very similar BUG message) on a sheevaplug. It could be related to the network stack. I run into it when 1) running a python script that streams data over the network 2) accessing saned via xinetd. 3) Other times seemingly at random.
I have encountered this using kernel-kirkwood-3.4.2-3.fc17.armv5tel.rpm, and 3.6.0-0.rc6.git2.1.fc18.armv5tel.kirkwood. (For me, the 3.5.X kernels don't boot due to a different RTC bug). I'll attach an example of the BUG message below.
There is at least one other person running into this bug on a dreamplug: http://forums.fedoraforum.org/showthread.php?t=284430
I have noticed that the RTC is set to a strange value after the soft lockup.
Created attachment 617361 [details]
soft lockup bug console log
Thank you for your bug reports!
Could you please check whether the following patch fixes the issue for you?
Unfortunately, I no longer own the IONICS Stratus, so I cannot try out the proposed fix. However, the patch you mentioned sure seems like a good candidate.
Matt - you had a similar problem. Are you in a position to try the fix?
I will try this with my sheevaplug - it will be a couple of weeks before I can get to it though. I'm in the middle of a huge deadline. Sorry!
Actually, if you have a compiled kernel rpm package I can try it sooner, but I don't have an ARM build setup going right now.
Matt, this is what I use on my Guruplug currently:
Hope it helps!
Lubomir, does that kernel have the patch you suggested applied? I have installed it on my sheevaplug, and all seems well so far. I'll report back if any soft lockups happen.
Just a followup - looks good so far. Haven't had any lockups under the circumstances that previously caused them. However, I do get messages like these in the messages file:
[92579.500850] mv_xor mv_xor.0: mv_xor_clean_completed_slots 362
[92579.500914] mv_xor mv_xor.0: mv_xor_prep_dma_memcpy dest: 18ad9960 src 1d0cb084 len: 258 flags: 18
[92579.500936] mv_xor mv_xor.0: mv_xor_prep_dma_memcpy sw_desc de692000 async_tx de69203c
[92579.500951] mv_xor mv_xor.0: mv_xor_tx_submit sw_desc de692000: async_tx de69203c
[92579.500968] mv_xor mv_xor.0: mv_xor_start_new_chain 270: sw_desc de692000
[92579.500980] mv_xor mv_xor.0: activate chan.
[92579.501000] mv_xor mv_xor.0: intr cause 3
[92579.501014] mv_xor mv_xor.0: mv_xor_device_clear_eoc_cause, val 0xfffffffe
[92579.501036] mv_xor mv_xor.0: __mv_xor_slot_cleanup 402
[92579.501050] mv_xor mv_xor.0: current_desc 7844d40
[92579.501062] mv_xor mv_xor.0: mv_xor_clean_completed_slots 362
[92579.501097] mv_xor mv_xor.0: mv_xor_clean_slot 379: desc de692000 flags 18
[92579.501113] mv_xor mv_xor.0: mv_xor_free_slots 255 slot de692000
[92579.501164] mv_xor mv_xor.0: mv_xor_clean_completed_slots 362
Thank you for your testing, Matt.
What you see is expected -- it's based off Rawhide kernel that has tons of debugging and sanity checks enabled (this is from CONFIG_DMA_DEBUG or something like that), equivalent to kernel-debug from release builds. It is probably a bit slower than release kernel as well, thus if you intend to use this, you may want to reapply the patch to a release kernel instead.
Have a nice 2013!
As mentioned on the linux-arm-kernel mailing list, removing these checks is wrong. The hardware imposes a minimum buffer size of 16 bytes. In practice, any DMA operation less than 128 bytes are rejected, because software will be faster than setting up the DMA, handling the interrupt when its finished, scheduling and running a tasklet, and running the completion callback etc.
The crypto/async_tx/async_memcpy.c:async_memcpy() functions has no issues with this minimum size limit, it falls back to memcpy().
Have you tried the following. On low memory systems it basically keeps a little free for things like network buffers. We've seen it helps for some other ARM platforms.
cat << EOF > /etc/sysctl.conf
vm.min_free_kbytes = 12288
I tried kernel-kirkwood-3.7.3-101.fc17.armv5tel, and it seemed to still suffer from this problem. I have been running Lubomir's patch for two months without encountering this bug, so something he did should give a clue as to what the problem is, even if it's not the "right" fix.
It seems vm.min_free_kbytes = 12288 is already set by default by initscripts-9.37.2-1.fc17.armv5tel, so that does not address this issue.
Please retest with 3.8.x
Please reopen this bug. I still have this problem with the latest kernel, 3.8.3-102.fc17.armv5tel.kirkwood.
Please reopen this bug. The current kernel still has this problem.