Description of problem: We have several physical servers running Redhat 5.2 (kernel version = 2.6.18-92.1.1.el5xen #1 SMP ). They have Xen domains running on them as Redhat 4.6 servers (kernel version 2.6.9-67.0.15.ELxenU). A developer here was recursively scping a directory structure from one domain to another (on different physical servers) and this causes a panic on the domain where the files were being copied from. Version-Release number of selected component (if applicable): Redhat 5.2 (kernel version = 2.6.18-92.1.1.el5xen #1 SMP ) and Redhat 4.6 (kernel version 2.6.9-67.0.15.ELxenU) The domains run from files on the Dom 0's filesystem. How reproducible: recursively scp a directory structure from one domain to another. This has affected two different systems in the last two days. Steps to Reproduce: 1.recursively scp a large directory structure from one domain to another 2.wait 3.watch the domain the files are copied from panic Actual results: Panic on source domain: ------------[ cut here ]------------ kernel BUG at arch/i386/mm/hypervisor.c:336! invalid operand: 0000 [#1] SMP Modules linked in: nls_utf8 cifs md5 ipv6 dm_mirror dm_mod xennet ext3 jbd xenblk sd_mod scsi_mod CPU: 0 EIP: 0061:[<c0115983>] Not tainted VLI EFLAGS: 00010096 (2.6.9-67.0.15.ELxenU) EIP is at xen_create_contiguous_region+0x362/0x420 eax: ffffffff ebx: 00000006 ecx: d3295cc0 edx: 00000001 esi: 00000000 edi: c8365000 ebp: 00008365 esp: d3295ca4 ds: 007b es: 007b ss: 0068 Process scp (pid: 31802, threadinfo=d3295000 task=e3132030) Stack: 3c210063 80000000 00000000 00000000 00000000 00000001 00008365 d3295cbc 00000001 00000000 00000000 00007ff0 000004d0 c02a2980 00000001 00000001 c3258040 c8364000 00000001 c01fae26 c8364000 00000001 00000000 00000000 Call Trace: [<c01fae26>] skbuff_ctor+0x2c/0x56 [<c01431f9>] cache_init_objs+0x35/0x56 [<c014337b>] cache_grow+0xfb/0x187 [<c014356a>] cache_alloc_refill+0x163/0x19c [<c0143785>] kmem_cache_alloc+0x67/0x97 [<c0212894>] alloc_skb_from_cache+0x3a/0xb2 [<c01fad9a>] __alloc_skb+0x76/0x7a [<c0211da5>] sock_alloc_send_pskb+0x6c/0x1d8 [<c0211f28>] sock_alloc_send_skb+0x17/0x1b [<c02667ed>] unix_stream_sendmsg+0x150/0x34f [<c020f4ab>] sock_aio_write+0xfe/0x10d [<c0159f03>] do_sync_write+0xaf/0xda [<c0119fec>] autoremove_wake_function+0x0/0x3a [<c0159fea>] vfs_write+0xbc/0xd8 [<c015a0a4>] sys_write+0x3b/0x63 [<c010734f>] syscall_call+0x7/0xb Code: 00 00 00 8b 44 24 50 bb 06 00 00 00 8d 4c 24 1c 8b 54 24 14 05 00 00 00 40 c1 e8 0c 8d 2c 10 89 6c 24 18 e8 00 b8 fe ff 48 74 08 <0f> 0b 50 01 26 54 27 c0 8b 44 24 18 31 f6 89 fb 8b 0d cc c8 29 Expected results: Well, for the domain not to panic Additional info: We reverted the kernel to a previous version (2.6.9-67.0.7.ELxenU) on the domain we were copying the files from and the recursive scp completed successfully.
The list of patches that went in between 2.6.9-67.0.7 and 2.6.9-67.0.15 are: * Tue Apr 22 2008 Vitaly Mayatskikh <vmayatsk> [2.6.9-67.0.15] -fix kabi breakage in 67.0.14 * Tue Apr 22 2008 Vitaly Mayatskikh <vmayatsk> [2.6.9-67.0.14] -fs: serialize file access for dnotify (Alexander Viro) [443437] {CVE-2008-1669} -update: fix race condition in dnotify (Alexander Viro) [439756] {CVE-2008-1375} * Wed Apr 16 2008 Vitaly Mayatskikh <vmayatsk> [2.6.9-67.0.13] -Revert: Add HP DL580 G5 to bfsort whitelist (Tony Camuso) [437976] * Mon Apr 14 2008 Vitaly Mayatskikh <vmayatsk> [2.6.9-67.0.12] -fs: fix race condition in dnotify (Alexander Viro) [439756] {CVE-2008-1375} * Wed Apr 9 2008 Vitaly Mayatskikh <vmayatsk> [2.6.9-67.0.11] -nfs: High vm pagecache reclaim latency on systems with large highmem to lowmem ratio fix (Larry Woodman) [438345] -nfs: Fix nfs read performance regression. Introduce a new tunable (Larry Woodman) [438477] -Retry: check to see if agp is valid before reporting aperture size warnings (Brian Maly) [392771 431897] -Ensure IV is in linear part of the skb to avoid BUG due to OOB access (Thomas Graf) [427245] {CVE-2007-6282} -fix unprivileged crash on x86_64 cs corruption (Jarod Wilson) [439786] {CVE-2008-1615} * Wed Mar 19 2008 Vitaly Mayatskikh <vmayatsk> [2.6.9-67.0.10] -update: do not return zero in mmap (Vitaly Mayatskikh) [400811] -neofb: avoid overwriting fb_info fields (Vitaly Mayatskikh) [430251] -[NET] link_watch: always schedule urgent events (Don Dutile) [436102] -nlm: fix a client side race on blocking locks (Jeff Layton) [436129] -nlm: cleanup for blocked locks (Jeff Layton) [436129] -Add HP DL580 G5 to bfsort whitelist (Tony Camuso) [437976] -nfs: Discard pagecache data for dirs on denty_iput (Jeff Layton) [437788] * Wed Mar 12 2008 Vitaly Mayatskikh <vmayatsk> [2.6.9-67.0.9] -[NET] link_watch: handle jiffies wraparound (Vince Worthington) [436749] -libata: un-blacklist hitachi drives to enable NCQ (David Milburn) [436499] -libata: sata_nv may send commands with duplicate tags (David Milburn) [436499] * Fri Mar 7 2008 Vitaly Mayatskikh <vmayatsk> [2.6.9-67.0.8] -Insufficient range checks in fault handlers with mremap (Vitaly Mayatskikh) [428968] {CVE-2008-0007} -[MOXA] buffer overflow in moxa driver (Vitaly Mayatskikh) [423131] {CVE-2005-0504} -Fix unix stream socket recv race condition (Hideo AOKI) [435122] Interestingly, none of them are xen-specific, so one of these other ones must have caused the breakage. We'll have to look further to see what is going on. Chris Lalancette
I'm unable to recreate this. I've scp'ed entire file systems under a load using the kernel versions listed above. I've tried both with and without nfs - even though it doesn't look like nfs was involved based on the "modules linked in" line of the panic. Since this is an old bug I'm guessing it's not being seen any more, at least not on later kernels. I'm going to close for now as insufficient data. If anybody can recreate this reliably then they can reopen it. I'll be happy to work with them to create bisections using some of the patches listed in c#2, i.e. what touches the call trace, or otherwise looks suspicious. Also, on reopening we should get the hardware details and full VM config.