Bug 444242

Summary: x86_64 pv_ops xen: Kernel panic while swapping (xen_failsafe_callback)
Product: [Fedora] Fedora Reporter: Jan Oravec <jan.oravec>
Component: kernel-xen-2.6Assignee: Eduardo Habkost <ehabkost>
Status: CLOSED CURRENTRELEASE QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: low Docs Contact:
Priority: low    
Version: 9CC: berrange, xen-maint
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: xen-pvops-2.6.26-rc5-ehabkost2 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2008-06-16 07:01:37 EDT Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
Bug Depends On:    
Bug Blocks: 434756    
Attachments:
Description Flags
Another kernel panic
none
Another kernel panic #2
none
config none

Description Jan Oravec 2008-04-25 18:58:58 EDT
I have extracted Xen patches from kernel-xen-2.6-2.6.25-1.fc9.src.rpm and
applied them to 2.6.25. I am running this guest as XenU on Xen 3.2.0. Sometimes
when VM is swapping, I get this kernel panic:

DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process swapper (pid: 0, threadinfo ffffffff8055c000, task ffffffff8051d320)
Stack:  0000000000000001 ffffffff8055dcc8 ffffffff805278c0 0000000000000000
 ffffffff8055dcc8 ffffffff8055dcc8 ffff880001388080 0000000000000000
 000000000000000b ffffffff8055ddd8 ffffffff8051d320 ffff88000fc48f00
Call Trace:
 [<ffffffff80211ffd>] ? oops_end+0x7d/0x80
 [<ffffffff8021340c>] ? do_invalid_op+0x8c/0xa0
 [<ffffffff804878d0>] ? xen_failsafe_callback+0x0/0x10
 [<ffffffff8023bd46>] ? alloc_pid+0x296/0x3d0
 [<ffffffff8027e71d>] ? kmem_cache_alloc+0x6d/0xd0
 [<ffffffff804874ab>] ? error_exit+0x0/0x61
 [<ffffffff804878d0>] ? xen_failsafe_callback+0x0/0x10
 [<ffffffff804878d0>] ? xen_failsafe_callback+0x0/0x10
 [<ffffffff8020e668>] ? xen_clocksource_read+0x38/0x60
 [<ffffffff8020c4b2>] ? xen_mc_flush+0x52/0x180
 [<ffffffff8020c0d6>] ? xen_load_tls+0x76/0xa0
 [<ffffffff8020f908>] ? __switch_to+0x88/0x370
 [<ffffffff8020be4b>] ? xen_write_cr3+0x15b/0x170
 [<ffffffff80485d92>] ? thread_return+0x0/0x15e
 [<ffffffff8020c3c0>] ? xen_idle+0x0/0x50
 [<ffffffff802101a0>] ? cpu_idle+0x60/0x70


Code: 83 bb 08 07 00 00 00 74 05 e8 5f 2d 16 00 48 8b bb 60 07 00 00 48 85 ff 74
05 e8 fe 9c 05 00 48 c7 03 40 00 00 00 e8 b2 74 25 00 <0f> 0b eb fe 8b 8b 20 01
00 00 83 f9 ff 0f 85 27 ff ff ff 44 8b 
RIP  [<ffffffff8022e73e>] do_exit+0x51e/0x750
 RSP <ffffffff8055dca8>
---[ end trace b060fa16d852ccc5 ]---
Kernel panic - not syncing: Attempted to kill the idle task!
Comment 1 Jan Oravec 2008-04-26 17:04:17 EDT
Created attachment 303872 [details]
Another kernel panic
Comment 2 Jan Oravec 2008-04-26 17:04:42 EDT
Created attachment 303873 [details]
Another kernel panic #2
Comment 3 Jan Oravec 2008-04-26 17:09:59 EDT
Created attachment 303874 [details]
config

Attaching 2 similar kernel panics with different test load and config file.

The original kernel panic is while compiling gcc on swap, next is stressed
mysql server, and another one is postfix with some milter spam checks.

Also attaching kernel config file.

GCC compilation heavily stressed swap, but mysql and postfix swapped only
barely.
Comment 4 Mark McLoughlin 2008-04-28 12:09:59 EDT
Jan: thanks for giving this a shot; if you want to keep up with the latest
x86_64 work, try building the master branch from:

  http://git.et.redhat.com/?p=xen-pvops-64.git

The issue here is that a kernel bug can cause xen_failsafe callback to be
called, but we don't currently implement that - i.e. the underlying cause is a
kernel bug, but we should implement xen_failsafe_callback to handle the bug more
gracefully anyway. See bug #442949 for another example.

Eduardo: re-assigning to you; note that it isn't a kernel-xen RPM build, but the
x86_64 patches on 2.6.25
Comment 5 Mark McLoughlin 2008-04-28 12:12:50 EDT
Just for future reference - we're talking about patches from the
xen-pvops-64-2.6.25-rc9-markmc1 tag here, probably
Comment 6 Mark McLoughlin 2008-05-11 18:06:16 EDT
Jan: try again with the xen-pvops-64-2.6.25-rc9-ehabkost2 tag
Comment 7 Jan Oravec 2008-05-11 18:30:57 EDT
Thanks Mark. I am currently in Las Vegas, I will give it a try as soon as I
return back to Slovakia (19th May).
Comment 8 Bug Zapper 2008-05-14 06:11:39 EDT
Changing version to '9' as part of upcoming Fedora 9 GA.
More information and reason for this action is here:
http://fedoraproject.org/wiki/BugZappers/HouseKeeping
Comment 9 Jan Oravec 2008-05-20 07:04:16 EDT
Mark, I am still getting crash, I hope I extracted patches correctly (I am new
to git), I extracted them with the command below and applied them to 2.6.25.4:

git-format-patch
120dd64cacd4fb796bca0acba3665553f1d9ecaa..3e1f4183fe564f53cf3642cc2ad9582c683a33ed
(first is Linux-2.6.25-rc9 commit, second is 'Clear %fs on xen_load_tls() commit)

The workload was asterisk doing 4-way phone conference over network.

RBP: ffff88000ee7de14 R08: ffff88000ee7c000 R09: 0000000000000001
R10: 0000000000000001 R11: ffffffff8020eab0 R12: ffff88000ff4f180
R13: ffff88000ee7de2c R14: ffff88000ee7de08 R15: 0000000000000000
FS:  00007f230c0716f0(0063) GS:ffffffff8054f000(0000) knlGS:0000000000000000
CS:  e033 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 000000000ed24000 CR4: 0000000000000660
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process asterisk (pid: 2799, threadinfo ffff88000ee7c000, task ffff88000fc7c780)
Stack:  ffffffff8028f93d ffff88000ee6e600 ffff88000ee7dbb8 ffff88000ee6e870
 ffff88000ee7df68 0000000040ce6770 0000000000000000 ffffffff802906d0
 0000000000000000 0000000300000000 ffff88000ffebc00 ffffffff00000000
Call Trace:
 [<ffffffff8028f93d>] ? do_sys_poll+0x14d/0x380
 [<ffffffff802906d0>] ? __pollwait+0x0/0x130
 [<ffffffff802266a0>] ? default_wake_function+0x0/0x10
 [<ffffffff802266a0>] ? default_wake_function+0x0/0x10
 [<ffffffff802266a0>] ? default_wake_function+0x0/0x10
 [<ffffffff803de97b>] ? sock_sendmsg+0xcb/0x100
 [<ffffffff8039c31a>] ? rb_insert_color+0x8a/0xf0
 [<ffffffff802892a8>] ? pipe_read+0x2d8/0x420
 [<ffffffff8023e3a0>] ? autoremove_wake_function+0x0/0x30
 [<ffffffff80288fd9>] ? pipe_read+0x9/0x420
 [<ffffffff80282289>] ? do_sync_read+0xd9/0x120
 [<ffffffff803de261>] ? sockfd_lookup_light+0x41/0x80
 [<ffffffff803df554>] ? sys_sendto+0x174/0x180
 [<ffffffff804874ab>] ? error_exit+0x0/0x61
 [<ffffffff804874d8>] ? error_exit+0x2d/0x61
 [<ffffffff804874ab>] ? error_exit+0x0/0x61
 [<ffffffff8020e668>] ? xen_clocksource_read+0x38/0x60
 [<ffffffff80290cee>] ? sys_poll+0x2e/0x90
 [<ffffffff80211110>] ? sysret_check+0x1c/0x64
 [<ffffffff802110ea>] ? system_call_after_saveargs+0x38/0x3d
 [<ffffffff80487930>] ? xen_system_call_entry+0x0/0x35


Code: c3 0f 1f 40 00 e9 5b fc ff ff 66 66 2e 0f 1f 84 00 00 00 00 00 65 48 8b 04
25 00 00 00 00 48 8b 80 18 06 00 00 c7 06 00 00 00 00 <83> 38 01 75 1f 48 8b 50
08 3b 3a 73 11 89 f8 48 c1 e0 03 48 03 
RIP  [<ffffffff80283796>] fget_light+0x16/0x90
 RSP <ffff88000ee7db80>
CR2: 0000000000000000
---[ end trace b484998d28b71338 ]---
Fixing recursive fault but reboot is needed!
------------[ cut here ]------------
kernel BUG at /usr/src/linux-2.6.25-gentoo-r1/kernel/exit.c:1022!
invalid opcode: 0000 [3] 
CPU 0 
Pid: 0, comm: swapper Tainted: G      D  2.6.25-gentoo-r1 #4
RIP: e030:[<ffffffff8022e73e>]  [<ffffffff8022e73e>] do_exit+0x51e/0x750
RSP: e02b:ffffffff8055dca8  EFLAGS: 00010246
RAX: ffffffff8055dfd8 RBX: ffff88000fc7c780 RCX: 0000000000000100
RDX: 0000000000000000 RSI: ffff88000fd5a780 RDI: ffffffff805a9068
RBP: 0000000000000020 R08: ffffffff8055c000 R09: 00000000ffffffff
R10: 0000000000000001 R11: ffffffff8044fab0 R12: ffff88000fc7c860
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
FS:  00007f230c0716f0(0000) GS:ffffffff8054f000(0000) knlGS:0000000000000000
CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000000 CR3: 000000000ec4c000 CR4: 0000000000000660
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process swapper (pid: 0, threadinfo ffffffff8055c000, task ffffffff8051d320)
Stack:  0000000000000001 ffffffff8055dcc8 ffffffff805278c0 0000000000000000
 ffffffff8055dcc8 ffffffff8055dcc8 ffff880001388080 0000000000000000
 000000000000000b ffffffff8055ddd8 ffffffff8051d320 ffff88000fc7c780
Call Trace:
 [<ffffffff80211ffd>] ? oops_end+0x7d/0x80
 [<ffffffff8021340c>] ? do_invalid_op+0x8c/0xa0
 [<ffffffff804878d0>] ? xen_failsafe_callback+0x0/0x10
 [<ffffffff8023bd46>] ? alloc_pid+0x296/0x3d0
 [<ffffffff8027e71d>] ? kmem_cache_alloc+0x6d/0xd0
 [<ffffffff804874ab>] ? error_exit+0x0/0x61
 [<ffffffff804878d0>] ? xen_failsafe_callback+0x0/0x10
 [<ffffffff804878d0>] ? xen_failsafe_callback+0x0/0x10
 [<ffffffff8020e668>] ? xen_clocksource_read+0x38/0x60
 [<ffffffff804359c0>] ? udp_poll+0x0/0x100
 [<ffffffff8020c4b2>] ? xen_mc_flush+0x52/0x180
 [<ffffffff8020c0d6>] ? xen_load_tls+0x76/0xa0
 [<ffffffff8020f908>] ? __switch_to+0x88/0x370
 [<ffffffff80485d92>] ? thread_return+0x0/0x15e
 [<ffffffff8020c3c0>] ? xen_idle+0x0/0x50
 [<ffffffff802101a0>] ? cpu_idle+0x60/0x70


Code: 83 bb 08 07 00 00 00 74 05 e8 5f 2d 16 00 48 8b bb 60 07 00 00 48 85 ff 74
05 e8 fe 9c 05 00 48 c7 03 40 00 00 00 e8 b2 74 25 00 <0f> 0b eb fe 8b 8b 20 01
00 00 83 f9 ff 0f 85 27 ff ff ff 44 8b 
RIP  [<ffffffff8022e73e>] do_exit+0x51e/0x750
 RSP <ffffffff8055dca8>
---[ end trace b484998d28b71338 ]---
Kernel panic - not syncing: Attempted to kill the idle task!
Comment 10 Jan Oravec 2008-06-16 07:01:37 EDT
I have tried xen-pvops-2.6.26-rc5-ehabkost2 on 2 VMs. They are running fine for
more than 24 hours. Before, I have seen crashes in less than 1 hour. I will
slowly start with migrating less-important production VMs to test it even more.

I am marking this bug as fixed.