Bug 453928

Summary:	Recusrive scping of files from one Xen domain to another (on another physical server) causes a panic
Product:	Red Hat Enterprise Linux 4	Reporter:	John Burnham <jpb15>
Component:	kernel-xen	Assignee:	Andrew Jones <drjones>
Status:	CLOSED INSUFFICIENT_DATA	QA Contact:	Martin Jenner <mjenner>
Severity:	high	Docs Contact:
Priority:	low
Version:	4.6	CC:	clalance, drjones, rick.beldin, tom, xen-maint
Target Milestone:	rc
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2009-08-07 13:17:26 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	458302

Description John Burnham 2008-07-03 09:59:52 UTC

Description of problem:

We have several physical servers running Redhat 5.2 (kernel version = 
2.6.18-92.1.1.el5xen #1 SMP ). They have Xen domains running on them as Redhat
4.6 servers (kernel version 2.6.9-67.0.15.ELxenU). A developer here was
recursively scping a directory structure from one domain to another (on
different physical servers) and this causes a panic on the domain where the
files were being copied from.

Version-Release number of selected component (if applicable):
Redhat 5.2 (kernel version =  2.6.18-92.1.1.el5xen #1 SMP )
and
Redhat 4.6 (kernel version 2.6.9-67.0.15.ELxenU)
The domains run from files on the Dom 0's filesystem. 

How reproducible:

recursively scp a directory structure from one domain to another. This has
affected two different systems in the last two days.

Steps to Reproduce:
1.recursively scp a large directory structure from one domain to another
2.wait
3.watch the domain the files are copied from panic
  
Actual results:

Panic on source domain:
------------[ cut here ]------------
kernel BUG at arch/i386/mm/hypervisor.c:336!
invalid operand: 0000 [#1]
SMP
Modules linked in: nls_utf8 cifs md5 ipv6 dm_mirror dm_mod xennet ext3 jbd
xenblk sd_mod scsi_mod
CPU:    0
EIP:    0061:[<c0115983>]    Not tainted VLI
EFLAGS: 00010096   (2.6.9-67.0.15.ELxenU)
EIP is at xen_create_contiguous_region+0x362/0x420
eax: ffffffff   ebx: 00000006   ecx: d3295cc0   edx: 00000001
esi: 00000000   edi: c8365000   ebp: 00008365   esp: d3295ca4
ds: 007b   es: 007b   ss: 0068
Process scp (pid: 31802, threadinfo=d3295000 task=e3132030)
Stack: 3c210063 80000000 00000000 00000000 00000000 00000001 00008365 d3295cbc
       00000001 00000000 00000000 00007ff0 000004d0 c02a2980 00000001 00000001
       c3258040 c8364000 00000001 c01fae26 c8364000 00000001 00000000 00000000
Call Trace:
 [<c01fae26>] skbuff_ctor+0x2c/0x56
 [<c01431f9>] cache_init_objs+0x35/0x56
 [<c014337b>] cache_grow+0xfb/0x187
 [<c014356a>] cache_alloc_refill+0x163/0x19c
 [<c0143785>] kmem_cache_alloc+0x67/0x97
 [<c0212894>] alloc_skb_from_cache+0x3a/0xb2
 [<c01fad9a>] __alloc_skb+0x76/0x7a
 [<c0211da5>] sock_alloc_send_pskb+0x6c/0x1d8
 [<c0211f28>] sock_alloc_send_skb+0x17/0x1b
 [<c02667ed>] unix_stream_sendmsg+0x150/0x34f
 [<c020f4ab>] sock_aio_write+0xfe/0x10d
 [<c0159f03>] do_sync_write+0xaf/0xda
 [<c0119fec>] autoremove_wake_function+0x0/0x3a
 [<c0159fea>] vfs_write+0xbc/0xd8
 [<c015a0a4>] sys_write+0x3b/0x63
 [<c010734f>] syscall_call+0x7/0xb
Code: 00 00 00 8b 44 24 50 bb 06 00 00 00 8d 4c 24 1c 8b 54 24 14 05 00 00 00 40
c1 e8 0c 8d 2c 10 89 6c 24 18 e8 00 b8 fe ff 48 74 08 <0f> 0b 50 01 26 54 27 c0
8b 44 24 18 31 f6 89 fb 8b 0d cc c8 29


Expected results:

Well, for the domain not to panic

Additional info:
We reverted the kernel to a previous version (2.6.9-67.0.7.ELxenU) on the domain
we were copying the files from and the recursive scp completed successfully.

Comment 2 Chris Lalancette 2009-01-22 10:13:26 UTC

The list of patches that went in between 2.6.9-67.0.7 and 2.6.9-67.0.15 are:

* Tue Apr 22 2008 Vitaly Mayatskikh <vmayatsk> [2.6.9-67.0.15]
-fix kabi breakage in 67.0.14

* Tue Apr 22 2008 Vitaly Mayatskikh <vmayatsk> [2.6.9-67.0.14]
-fs: serialize file access for dnotify (Alexander Viro) [443437] {CVE-2008-1669}
-update: fix race condition in dnotify (Alexander Viro) [439756] {CVE-2008-1375}

* Wed Apr 16 2008 Vitaly Mayatskikh <vmayatsk> [2.6.9-67.0.13]
-Revert: Add HP DL580 G5 to bfsort whitelist (Tony Camuso) [437976]

* Mon Apr 14 2008 Vitaly Mayatskikh <vmayatsk> [2.6.9-67.0.12]
-fs: fix race condition in dnotify (Alexander Viro) [439756] {CVE-2008-1375}

* Wed Apr  9 2008 Vitaly Mayatskikh <vmayatsk> [2.6.9-67.0.11]
-nfs: High vm pagecache reclaim latency on systems with large highmem to lowmem ratio fix (Larry Woodman) [438345]
-nfs: Fix nfs read performance regression. Introduce a new tunable (Larry Woodman) [438477]
-Retry: check to see if agp is valid before reporting aperture size warnings (Brian Maly) [392771 431897]
-Ensure IV is in linear part of the skb to avoid BUG due to OOB access (Thomas Graf) [427245] {CVE-2007-6282}
-fix unprivileged crash on x86_64  cs corruption (Jarod Wilson) [439786] {CVE-2008-1615}

* Wed Mar 19 2008 Vitaly Mayatskikh <vmayatsk> [2.6.9-67.0.10]
-update: do not return zero in mmap (Vitaly Mayatskikh) [400811]
-neofb: avoid overwriting fb_info fields (Vitaly Mayatskikh) [430251]
-[NET] link_watch: always schedule urgent events (Don Dutile) [436102]
-nlm: fix a client side race on blocking locks (Jeff Layton) [436129]
-nlm: cleanup for blocked locks (Jeff Layton) [436129]
-Add HP DL580 G5 to bfsort whitelist (Tony Camuso) [437976]
-nfs: Discard pagecache data for dirs on denty_iput (Jeff Layton) [437788]

* Wed Mar 12 2008 Vitaly Mayatskikh <vmayatsk> [2.6.9-67.0.9]
-[NET] link_watch: handle jiffies wraparound (Vince Worthington) [436749]
-libata: un-blacklist hitachi drives to enable NCQ (David Milburn) [436499]
-libata: sata_nv may send commands with duplicate tags (David Milburn) [436499]

* Fri Mar  7 2008 Vitaly Mayatskikh <vmayatsk> [2.6.9-67.0.8]
-Insufficient range checks in fault handlers with mremap (Vitaly Mayatskikh) [428968] {CVE-2008-0007}
-[MOXA] buffer overflow in moxa  driver (Vitaly Mayatskikh) [423131] {CVE-2005-0504}
-Fix unix stream socket recv race condition (Hideo AOKI) [435122]

Interestingly, none of them are xen-specific, so one of these other ones must have caused the breakage.  We'll have to look further to see what is going on.

Chris Lalancette

Comment 3 Andrew Jones 2009-08-07 13:17:26 UTC

I'm unable to recreate this.  I've scp'ed entire file systems under a load using the kernel versions listed above.  I've tried both with and without nfs - even though it doesn't look like nfs was involved based on the "modules linked in" line of the panic.

Since this is an old bug I'm guessing it's not being seen any more, at least not on later kernels.  I'm going to close for now as insufficient data.  If anybody can recreate this reliably then they can reopen it.  I'll be happy to work with them to create bisections using some of the patches listed in c#2, i.e. what touches the call trace, or otherwise looks suspicious.  Also, on reopening we should get the hardware details and full VM config.