Bug 845143

Summary: Soft lockup on Ionics Stratus ARM computer
Product: [Fedora] Fedora Reporter: Steven A. Falco <safalco>
Component: kernelAssignee: Jon Masters <jcm>
Status: CLOSED CURRENTRELEASE QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: high Docs Contact:
Priority: unspecified    
Version: 17CC: andrew, gansalmon, itamar, jonathan, kernel-maint, lkundrak, madhu.chinakonda, matthew.hirsch, pbrobinson
Target Milestone: ---   
Target Release: ---   
Hardware: arm   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-04-13 08:27:47 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 245418    
Attachments:
Description Flags
Complete console log showing the lockup
none
soft lockup bug console log none

Description Steven A. Falco 2012-08-02 00:15:34 UTC
Description of problem: I am attempting to build activeMQ via maven / java, on an ARM-based IONICS Stratus plug computer.

The computer hangs part way into the build, and I get a kernel error message about a soft lockup.  It looks like this is originating somewhere in the networking stack:

[ 1069.060253] BUG: soft lockup - CPU#0 stuck for 22s! [java:686]
[ 1069.066116] Modules linked in: nfs nfs_acl auth_rpcgss fscache lockd sunrpc vfat fat mtdchar ofpart libertas_sdio libertas orion_nand cfg80211 rfkill nand nand_ecc mv643xx_eth nand_ids mtd mv_cesa usb_storage mmc_block mvsdio mmc_core [last unloaded: scsi_wait_scan]
[ 1069.089990]
[ 1069.091482] Pid: 686, comm:                 java
[ 1069.096116] CPU: 0    Not tainted  (3.4.6-2.fc17.armv5tel.kirkwood #1)       
[ 1069.102684] PC is at dma_cache_maint_page+0xb0/0xe0
[ 1069.107583] LR is at dma_cache_maint_page+0xa8/0xe0
[ 1069.112485] pc : [<c000f724>]    lr : [<c000f71c>]    psr: 80000013          
[ 1069.112490] sp : c7475c58  ip : c022eed0  fp : 00000fff00
[ 1069.124026] r10: 000002ec  r9 : c066cb88  r8 : 0a23300000
[ 1069.129276] r7 : 00000001  r6 : 00000076  r5 : c0b109c0  r4 : 00000abc       
[ 1069.135834] r3 : 00000000  r2 : c066cb88  r1 : 00000076  r0 : dde4eabc       
[ 1069.142392] Flags: Nzcv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment user
[ 1069.149559] Control: 0005397f  Table: 1dd6c000  DAC: 00000015
[ 1069.155349] [<c000f1f8>] (unwind_backtrace+0x0/0x124) from [<c0077e10>] (watchdog_timer_fn+0xf0/0x144)
[ 1069.164706] [<c0077e10>] (watchdog_timer_fn+0xf0/0x144) from [<c0038ab4>] (__run_hrtimer+0xb0/0x1d4)
[ 1069.173886] [<c0038ab4>] (__run_hrtimer+0xb0/0x1d4) from [<c00392d0>] (hrtimer_interrupt+0x10c/0x254)
[ 1069.183161] [<c00392d0>] (hrtimer_interrupt+0x10c/0x254) from [<c001604c>] (orion_timer_interrupt+0x24/0x34)
[ 1069.193046] [<c001604c>] (orion_timer_interrupt+0x24/0x34) from [<c007859c>] (handle_irq_event_percpu+0x38/0x240)
[ 1069.203360] [<c007859c>] (handle_irq_event_percpu+0x38/0x240) from [<c00787d4>] (handle_irq_event+0x30/0x40)
[ 1069.213240] [<c00787d4>] (handle_irq_event+0x30/0x40) from [<c007ae1c>] (handle_level_irq+0xbc/0xd0)
[ 1069.222421] [<c007ae1c>] (handle_level_irq+0xbc/0xd0) from [<c0077fc8>] (generic_handle_irq+0x28/0x38)
[ 1069.231778] [<c0077fc8>] (generic_handle_irq+0x28/0x38) from [<c0009b84>] (handle_IRQ+0x68/0x8c)
[ 1069.240611] [<c0009b84>] (handle_IRQ+0x68/0x8c) from [<c042c574>] (__irq_svc+0x34/0x80)
[ 1069.248662] [<c042c574>] (__irq_svc+0x34/0x80) from [<c000f724>] (dma_cache_maint_page+0xb0/0xe0)
[ 1069.257583] [<c000f724>] (dma_cache_maint_page+0xb0/0xe0) from [<c000fbdc>] (___dma_page_dev_to_cpu+0x7c/0xcc)
[ 1069.267642] [<c000fbdc>] (___dma_page_dev_to_cpu+0x7c/0xcc) from [<c022cda8>] (dma_async_memcpy_buf_to_pg+0x134/0x1a8)
[ 1069.278398] [<c022cda8>] (dma_async_memcpy_buf_to_pg+0x134/0x1a8) from [<c022e1f8>] (dma_memcpy_to_iovec+0xd4/0x158)
[ 1069.288981] [<c022e1f8>] (dma_memcpy_to_iovec+0xd4/0x158) from [<c036c7e0>] (dma_skb_copy_datagram_iovec+0x5c/0x1d4)
[ 1069.299564] [<c036c7e0>] (dma_skb_copy_datagram_iovec+0x5c/0x1d4) from [<c0392e9c>] (tcp_recvmsg+0x5a8/0x9f4)
[ 1069.309541] [<c0392e9c>] (tcp_recvmsg+0x5a8/0x9f4) from [<c03b0ae0>] (inet_recvmsg+0x48/0x5c)
[ 1069.318122] [<c03b0ae0>] (inet_recvmsg+0x48/0x5c) from [<c0345d60>] (sock_recvmsg+0xc0/0xe8)
[ 1069.326611] [<c0345d60>] (sock_recvmsg+0xc0/0xe8) from [<c03477b0>] (sys_recvfrom+0xa0/0x10c)
[ 1069.335182] [<c03477b0>] (sys_recvfrom+0xa0/0x10c) from [<c0347838>] (sys_recv+0x1c/0x20)
[ 1069.343404] [<c0347838>] (sys_recv+0x1c/0x20) from [<c0008c40>] (ret_fast_syscall+0x0/0x2c)


Version-Release number of selected component (if applicable):

Linux version 3.4.6-2.fc17.armv5tel.kirkwood (mockbuild.hsv.redhat.com)


How reproducible: 100%

Steps to Reproduce:
1. Download the source for activemq-parent-5.5.1
2. attempt to build via: mvn clean install
  
Actual results: soft lock up

Expected results: no lock up


Additional info: I have a 500 GB USB hard-drive connected, providing the root filesystem.  The kernel / initrd are loaded over tftp.  I'll add an attachment showing the entire console output from bootup to the hang.

Comment 1 Steven A. Falco 2012-08-02 00:16:26 UTC
Created attachment 601844 [details]
Complete console log showing the lockup

Comment 2 Matt Hirsch 2012-09-26 02:40:19 UTC
I'll add a "me too" - I have been having this problem (a very similar BUG message) on a sheevaplug. It could be related to the network stack. I run into it when 1) running a python script that streams data over the network 2) accessing saned via xinetd. 3) Other times seemingly at random.

I have encountered this using kernel-kirkwood-3.4.2-3.fc17.armv5tel.rpm, and 3.6.0-0.rc6.git2.1.fc18.armv5tel.kirkwood. (For me, the 3.5.X kernels don't boot due to a different RTC bug). I'll attach an example of the BUG message below.

There is at least one other person running into this bug on a dreamplug: http://forums.fedoraforum.org/showthread.php?t=284430

I have noticed that the RTC is set to a strange value after the soft lockup.

Comment 3 Matt Hirsch 2012-09-26 02:42:43 UTC
Created attachment 617361 [details]
soft lockup bug console log

Comment 4 Lubomir Rintel 2012-12-27 19:46:19 UTC
Thank you for your bug reports!
Could you please check whether the following patch fixes the issue for you?

https://github.com/lkundrak/linux/commit/e88886956244c67d924d612c9a8af7d01f1adc26

Comment 5 Steven A. Falco 2012-12-28 01:25:57 UTC
Unfortunately, I no longer own the IONICS Stratus, so I cannot try out the proposed fix.  However, the patch you mentioned sure seems like a good candidate.

Matt - you had a similar problem.  Are you in a position to try the fix?

Comment 6 Matt Hirsch 2012-12-28 01:46:36 UTC
I will try this with my sheevaplug - it will be a couple of weeks before I can get to it though. I'm in the middle of a huge deadline. Sorry!

Actually, if you have a compiled kernel rpm package I can try it sooner, but I don't have an ARM build setup going right now.

Comment 7 Lubomir Rintel 2012-12-28 16:44:46 UTC
Matt, this is what I use on my Guruplug currently:
http://fedorapeople.org/~lkundrak/kernel-kirkwood/

Hope it helps!

Comment 8 Matt Hirsch 2012-12-28 22:23:21 UTC
Lubomir, does that kernel have the patch you suggested applied? I have installed it on my sheevaplug, and all seems well so far. I'll report back if any soft lockups happen.

Thanks!

Matt

Comment 9 Matt Hirsch 2012-12-29 23:43:34 UTC
Just a followup - looks good so far. Haven't had any lockups under the circumstances that previously caused them. However, I do get messages like these in the messages file:

[92579.500850] mv_xor mv_xor.0: mv_xor_clean_completed_slots 362
[92579.500914] mv_xor mv_xor.0: mv_xor_prep_dma_memcpy dest: 18ad9960 src 1d0cb084 len: 258 flags: 18
[92579.500936] mv_xor mv_xor.0: mv_xor_prep_dma_memcpy sw_desc de692000 async_tx de69203c
[92579.500951] mv_xor mv_xor.0: mv_xor_tx_submit sw_desc de692000: async_tx de69203c
[92579.500968] mv_xor mv_xor.0: mv_xor_start_new_chain 270: sw_desc de692000
[92579.500980] mv_xor mv_xor.0:  activate chan.
[92579.501000] mv_xor mv_xor.0: intr cause 3
[92579.501014] mv_xor mv_xor.0: mv_xor_device_clear_eoc_cause, val 0xfffffffe
[92579.501036] mv_xor mv_xor.0: __mv_xor_slot_cleanup 402
[92579.501050] mv_xor mv_xor.0: current_desc 7844d40
[92579.501062] mv_xor mv_xor.0: mv_xor_clean_completed_slots 362
[92579.501097] mv_xor mv_xor.0: mv_xor_clean_slot 379: desc de692000 flags 18
[92579.501113] mv_xor mv_xor.0: mv_xor_free_slots 255 slot de692000
[92579.501164] mv_xor mv_xor.0: mv_xor_clean_completed_slots 362

Comment 10 Lubomir Rintel 2012-12-31 19:00:48 UTC
Thank you for your testing, Matt.

What you see is expected -- it's based off Rawhide kernel that has tons of debugging and sanity checks enabled (this is from CONFIG_DMA_DEBUG or something like that), equivalent to kernel-debug from release builds. It is probably a bit slower than release kernel as well, thus if you intend to use this, you may want to reapply the patch to a release kernel instead.

Have a nice 2013!

Comment 11 Andrew Lunn 2013-01-06 13:47:58 UTC
As mentioned on the linux-arm-kernel mailing list, removing these checks is wrong. The hardware imposes a minimum buffer size of 16 bytes. In practice, any DMA operation less than 128 bytes are rejected, because software will be faster than setting up the DMA, handling the interrupt when its finished, scheduling and running a tasklet, and running the completion callback etc.

The crypto/async_tx/async_memcpy.c:async_memcpy() functions has no issues with this minimum size limit, it falls back to memcpy().

Comment 12 Peter Robinson 2013-01-09 10:16:43 UTC
Have you tried the following. On low memory systems it basically keeps a little free for things like network buffers. We've seen it helps for some other ARM platforms.

cat << EOF > /etc/sysctl.conf
vm.min_free_kbytes = 12288
EOF

Comment 13 Matt Hirsch 2013-02-15 14:32:01 UTC
Two notes: 

I tried kernel-kirkwood-3.7.3-101.fc17.armv5tel, and it seemed to still suffer from this problem. I have been running Lubomir's patch for two months without encountering this bug, so something he did should give a clue as to what the problem is, even if it's not the "right" fix.

It seems vm.min_free_kbytes = 12288 is already set by default by initscripts-9.37.2-1.fc17.armv5tel, so that does not address this issue.

Comment 14 Peter Robinson 2013-03-31 18:52:58 UTC
Please retest with 3.8.x

Comment 15 Matt Hirsch 2013-04-13 15:41:51 UTC
Please reopen this bug. I still have this problem with the latest kernel, 3.8.3-102.fc17.armv5tel.kirkwood.

Comment 16 Matt Hirsch 2013-04-27 03:33:21 UTC
Please reopen this bug. The current kernel still has this problem.