Description of problem: I have created a RHEL5.5-32bit HVM guest with e1000 nic_model. After save/restore operations, I find that there is a bug information in the guest(screenshot is attached). When I use rtl8139 nic_model, there is no problem. Version-Release number of selected component (if applicable): host: RHEL5.5-x86_64 kernel-xen-2.6.18-231.el5 xen-3.0.3-118.el5 guest:RHEL5.5-i386 xen::nic_model: e1000 How reproducible: 100% Steps to Reproduce: 1. Create a guest with e1000 nic_model [host]# xm cr $vm.cfg 2. Save this guest to a checkpoint file [host]# xm save ${DomU_ID} ${checkpoint_filename} 3. Restore this guest from the checkpoint file [host]# xm restore ${checkpoint_filename} 4. Use vnc to check guest. Actual results: After step 4, there is a bug information in the guest. Expected results: After step 4, guest should work well. Additional info:
Created attachment 460480 [details] configuration file of hvm guest
Created attachment 460481 [details] xend.log
Created attachment 460482 [details] dmesg info
Created attachment 460483 [details] screenshot of guest
The description says this was tested with -118 userspace, but was that really the case? I see (XEN) irq.c:285: Dom2 callback via changed to PCI INTx Dev 0x03 IntA in 'xm dmesg' and I don't believe that should be there with -118. Please confirm this problem exists with -118 (don't forget to restart xend after updating it). I think there's a chance that the following commit (in -118) could fix this 3d585ff xen: correct data-type in pyxc_set_hvm_param
(In reply to comment #5) > The description says this was tested with -118 userspace, but was that really > the case? I see > > (XEN) irq.c:285: Dom2 callback via changed to PCI INTx Dev 0x03 IntA > > in 'xm dmesg' and I don't believe that should be there with -118. Please > confirm this problem exists with -118 (don't forget to restart xend after > updating it). I think there's a chance that the following commit (in -118) > could fix this > > 3d585ff xen: correct data-type in pyxc_set_hvm_param Yes, I have reproduced it again. When I did it, I checked xen which is version -118 and restart my machine.
(In reply to comment #6) > Yes, I have reproduced it again. When I did it, I checked xen which is version > -118 and restart my machine. Ok, thanks for the additional test. As an aside, I'll have to try and figure out if the callback via change is ok or not, but for this bug I guess we'll start with the backtrace. Another important question though is whether or not this is a regression from 5.5 or 5.4, or if it just never worked.
(In reply to comment #7) > (In reply to comment #6) > > Yes, I have reproduced it again. When I did it, I checked xen which is version > > -118 and restart my machine. > > Ok, thanks for the additional test. As an aside, I'll have to try and figure > out if the callback via change is ok or not, but for this bug I guess we'll > start with the backtrace. > > Another important question though is whether or not this is a regression from > 5.5 or 5.4, or if it just never worked. I try to use xen-105 and kernel-xen-194 to reproduce it. But there is no problem and guest works well.
1). I CANNOT use xen-105 and kernel-xen-231 to reproduce it. 2). But with xen-118 and kernel-xen-194, the guest gets bug info and does not work.
This can't be reproduce with xen-105 as there was not e1000 model used when we restore guest - default rtl8139 was used. So it is not xen regression but problem in e1000 driver. Reassign to kernel-xen
Miroslav, it should still be triaged by using a build of xen-105 + the patch for bug 574540. I'll place the test build at http://people.redhat.com/pbonzini/bz653271 in ~1 hour.
(In reply to comment #13) > Miroslav, it should still be triaged by using a build of xen-105 + the patch > for bug 574540. I'll place the test build at > http://people.redhat.com/pbonzini/bz653271 in ~1 hour. I did this testing. e1000 does not work after restore. If we save with e1000 and restore with rtl8139 it works.
Possibly related to bug 723755.
Host: - x86_64 RHEL-5.7 - 2.6.18-278.el5xen - xen-3.0.3-133.el5 Guest: - i686 RHEL-5.7 - 2.6.18-274.el5 - eth0: e1000 Confirming problem; after save/restore I've gotten the same dump as in attachment 460483 [details]: BUG: soft lockup - CPU#3 stuck for 60s! [swapper:0] Pid: 0, comm: swapper EIP: 0060:[<f09c7b71>] CPU: 3 EIP is at e1000_intr+0x9b/0x106 [e1000] EFLAGS: 00000293 Not tainted (2.6.18-274.el5 #1) EAX: ffffffff EBX: eccb1000 ECX: 00000286 EDX: 00000286 ESI: eccb1400 EDI: eccb1610 EBP: 00000005 DS: 007b ES: 007b CR0: 8005003b CR2: b7f15000 CR3: 2d7ec000 CR4: 000006d0 [<c045019d>] handle_IRQ_event+0x45/0x8c [<c0450297>] __do_IRQ+0xb3/0x104 [<c04501e4>] __do_IRQ+0x0/0x104 [<c04074d8>] do_IRQ+0x9b/0xc3 [<c040597a>] common_interrupt+0x1a/0x20 [<c042a72d>] __do_softirq+0x57/0x114 [<c04073f9>] do_softirq+0x4e/0x92 [<c04501e4>] __do_IRQ+0x0/0x104 [<c04074f4>] do_IRQ+0xb7/0xc3 [<c040597a>] common_interrupt+0x1a/0x20 [<c0403c1c>] default_idle+0x0/0x59 [<c0403c4d>] default_idle+0x31/0x59 [<c0403d14>] cpu_idle+0x9f/0xb9 ======================= hda: DMA interrupt recovery hda: lost interrupt hda: dma_timer_expiry: dma status == 0x24 hda: DMA interrupt recovery hda: lost interrupt hda: dma_timer_expiry: dma status == 0x24 INFO: task events/0:14 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. events/0 D 00000054 3440 14 1 15 13 (L-TLB) effeaef8 00000046 d0b59be8 00000054 eccb1000 c07a18e0 00000292 00000009 c1704550 d0b5a214 00000054 0000062c 00000000 c170465c c1606580 ee0843c0 00000003 00000000 c06bd4c0 c06bd504 c1704550 c1606580 c0620458 ffffffff Call Trace: [<c0620458>] schedule+0xbc/0xa57 [<c0620ea5>] wait_for_completion+0x6b/0x8f [<c041f843>] default_wake_function+0x0/0xc [<c04351c0>] synchronize_rcu+0x2a/0x2f [<c0434e11>] wakeme_after_rcu+0x0/0x8 [<c05d37bd>] dev_deactivate+0x76/0xa1 [<c05cd2f1>] __linkwatch_run_queue+0x163/0x197 [<c05cd342>] linkwatch_event+0x1d/0x22 [<c0433dd1>] run_workqueue+0x81/0xc5 [<c05cd325>] linkwatch_event+0x0/0x22 [<c0434834>] worker_thread+0xd9/0x10d [<c041f843>] default_wake_function+0x0/0xc [<c043475b>] worker_thread+0x0/0x10d [<c0436c6e>] kthread+0xc0/0xee [<c0436bae>] kthread+0x0/0xee [<c0405c87>] kernel_thread_helper+0x7/0x10 ======================= INFO: task irqbalance:2246 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. irqbalance D 00000058 3172 2246 1 2264 2178 (NOTLB) ed0c9e98 00000086 b8a0d8d4 00000058 00000000 c0473b20 00000015 00000009 eee93aa0 b8a2ebb0 00000058 000212dc 00000001 eee93bac c160d3c4 eea31200 00000104 00000000 00000010 00000514 00000010 00100000 00000246 ffffffff Call Trace: [<c0473b20>] cache_alloc_refill+0xdd/0x48a [<c06217af>] __mutex_lock_slowpath+0x4d/0x7c [<c06217ed>] .text.lock.mutex+0xf/0x14 [<c05c5c14>] dev_load+0x14/0x39 [<c05baeb9>] sock_ioctl+0x0/0x1b3 [<c05c686b>] dev_ioctl+0x2ed/0x462 [<c04ca8dd>] inode_has_perm+0x54/0x5c [<c05bb04a>] sock_ioctl+0x191/0x1b3 [<c05baeb9>] sock_ioctl+0x0/0x1b3 [<c0487cb9>] do_ioctl+0x1c/0x5d [<c048824d>] vfs_ioctl+0x47b/0x4d3 [<c04882ed>] sys_ioctl+0x48/0x5f [<c0404f4b>] syscall_call+0x7/0xb ======================= And the guest becomes unusable. Changing the eth0 device model to rtl8139, the guest seems to work okay after restore.
This bug could be caused by the e1000 driver in the guest kernel (and then the component should be set to "kernel"), or in the e1000 emulation of qemu-dm (and then the component should be set to "xen"). Since the same bare-metal kernel seems to be working with real e1000 cards, I'm resetting the component to "xen". I tried to look at how e1000 save/restore works in upstream qemu-dm. In the e1000 save/restore functions, the arrays listing registers-to-be-saved were first unfolded [1] [2], then the save functionality was ported to "vmstate" [3]. The last version of the e1000 emulation before [1] is 7c131dd5. Running "git blame hw/e1000.c" on that version, and looking at nic_save(), we arrive at [4], which removed mmio_base, which is one (but surely not the only) proof that the e1000 save logic saw real changes after the RHEL-5 version was forked. In addition to resetting the component to xen, I'm closing this as WONTFIX. The patch under [4] is huge and there's no guarantee it would fix our problem. e1000 support was added to RHEL-5 qemu-dm in response to bug 344861 -- to enable Windows 2003 Server guests on IA64 hosts (xen-3.0.3-72.el5). A further change was made for bug 574540 in xen-3.0.3-107.el5 so that the e1000 setting survives a reboot from within guest. (See KB article 27773 linked from bug 574540: the article states in the Environment section that the fix is targeted at IA64 hosts and Win2K3 guests. See also bug 574540 comment 4.) It was verified in comment 14 that restoring from e1000 never worked (this fact may have been masked by bug 574540 for a while). After installation & reboot, the Win2K3 guest can setup the xenpv-win drivers (the xenpv-win README lists "Windows(R) 2003 Server Service Pack 2 (x86 and x64)" as supported). This bug was reported internally. After the customer-facing bug 574540 was opened 1.5 years ago, no customer interest appears to be expressed in this bug, ie. in save/restore/migration with e1000. I think it is not worth the effort and risks for 5.8. Feel free to reopen / revert my changes if I'm wrong. [1] http://xenbits.xen.org/gitweb/?p=people/aperard/qemu-dm.git;a=commitdiff;h=2e885049 [2] http://xenbits.xen.org/gitweb/?p=people/aperard/qemu-dm.git;a=commitdiff;h=28366c3a [3] http://xenbits.xen.org/gitweb/?p=people/aperard/qemu-dm.git;a=commitdiff;h=e482dc3e [4] http://xenbits.xen.org/gitweb/?p=people/aperard/qemu-dm.git;a=commitdiff;h=8da3ff18