653271 – RHEL5.5-32bit HVM guest with e1000 nic_model will get bug information after save/restore

Bug 653271 - RHEL5.5-32bit HVM guest with e1000 nic_model will get bug information after save/restore

Summary: RHEL5.5-32bit HVM guest with e1000 nic_model will get bug information after s...

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	xen
Sub Component:
Version:	5.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Miroslav Rezanina
QA Contact:	Virtualization Bugs
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2010-11-15 06:19 UTC by YangGuang
Modified:	2012-09-11 07:32 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2011-08-16 12:53:56 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
configuration file of hvm guest (532 bytes, text/plain) 2010-11-15 06:20 UTC, YangGuang	no flags	Details
xend.log (27.55 KB, text/plain) 2010-11-15 06:20 UTC, YangGuang	no flags	Details
dmesg info (16.00 KB, text/plain) 2010-11-15 06:21 UTC, YangGuang	no flags	Details
screenshot of guest (14.54 KB, image/png) 2010-11-15 06:21 UTC, YangGuang	no flags	Details
View All

Description YangGuang 2010-11-15 06:19:00 UTC

Description of problem:
I have created a RHEL5.5-32bit HVM guest with e1000 nic_model. After save/restore operations, I find that there is a bug information in the guest(screenshot is attached). When I use rtl8139 nic_model, there is no problem.

Version-Release number of selected component (if applicable):
host: RHEL5.5-x86_64
      kernel-xen-2.6.18-231.el5
      xen-3.0.3-118.el5
guest:RHEL5.5-i386
      xen::nic_model: e1000
      

How reproducible:
100%
      
Steps to Reproduce:
1. Create a guest with e1000 nic_model
   [host]# xm cr $vm.cfg
2. Save this guest to a checkpoint file
   [host]# xm save ${DomU_ID} ${checkpoint_filename}
3. Restore this guest from the checkpoint file
   [host]# xm restore ${checkpoint_filename}
4. Use vnc to check guest.

  
Actual results:
After step 4, there is a bug information in the guest. 

Expected results:
After step 4, guest should work well.

Additional info:

Comment 1 YangGuang 2010-11-15 06:20:23 UTC

Created attachment 460480 [details]
configuration file of hvm guest

Comment 2 YangGuang 2010-11-15 06:20:55 UTC

Created attachment 460481 [details]
xend.log

Comment 3 YangGuang 2010-11-15 06:21:21 UTC

Created attachment 460482 [details]
dmesg info

Comment 4 YangGuang 2010-11-15 06:21:56 UTC

Created attachment 460483 [details]
screenshot of guest

Comment 5 Andrew Jones 2010-11-15 10:48:46 UTC

The description says this was tested with -118 userspace, but was that really the case? I see 

(XEN) irq.c:285: Dom2 callback via changed to PCI INTx Dev 0x03 IntA

in 'xm dmesg' and I don't believe that should be there with -118. Please confirm this problem exists with -118 (don't forget to restart xend after updating it). I think there's a chance that the following commit (in -118) could fix this

3d585ff xen: correct data-type in pyxc_set_hvm_param

Comment 6 YangGuang 2010-11-15 14:04:45 UTC

(In reply to comment #5)
> The description says this was tested with -118 userspace, but was that really
> the case? I see 
> 
> (XEN) irq.c:285: Dom2 callback via changed to PCI INTx Dev 0x03 IntA
> 
> in 'xm dmesg' and I don't believe that should be there with -118. Please
> confirm this problem exists with -118 (don't forget to restart xend after
> updating it). I think there's a chance that the following commit (in -118)
> could fix this
> 
> 3d585ff xen: correct data-type in pyxc_set_hvm_param

Yes, I have reproduced it again. When I did it, I checked xen which is version -118 and restart my machine.

Comment 7 Andrew Jones 2010-11-15 14:11:26 UTC

(In reply to comment #6)
> Yes, I have reproduced it again. When I did it, I checked xen which is version
> -118 and restart my machine.

Ok, thanks for the additional test. As an aside, I'll have to try and figure out if the callback via change is ok or not, but for this bug I guess we'll start with the backtrace.

Another important question though is whether or not this is a regression from 5.5 or 5.4, or if it just never worked.

Comment 8 YangGuang 2010-11-16 04:39:49 UTC

(In reply to comment #7)
> (In reply to comment #6)
> > Yes, I have reproduced it again. When I did it, I checked xen which is version
> > -118 and restart my machine.
> 
> Ok, thanks for the additional test. As an aside, I'll have to try and figure
> out if the callback via change is ok or not, but for this bug I guess we'll
> start with the backtrace.
> 
> Another important question though is whether or not this is a regression from
> 5.5 or 5.4, or if it just never worked.

I try to use xen-105 and kernel-xen-194 to reproduce it. But there is no problem and guest works well.

Comment 9 YangGuang 2010-11-16 05:21:19 UTC

1). I CANNOT use xen-105 and kernel-xen-231 to reproduce it.
2). But with xen-118 and kernel-xen-194, the guest gets bug info and does not work.

Comment 11 Miroslav Rezanina 2010-11-16 09:23:29 UTC

This can't be reproduce with xen-105 as there was not e1000 model used when we restore guest - default rtl8139 was used. So it is not xen regression but problem in e1000 driver.

Reassign to kernel-xen

Comment 13 Paolo Bonzini 2011-01-18 13:30:52 UTC

Miroslav, it should still be triaged by using a build of xen-105 + the patch for bug 574540.  I'll place the test build at http://people.redhat.com/pbonzini/bz653271 in ~1 hour.

Comment 14 Miroslav Rezanina 2011-01-18 13:52:57 UTC

(In reply to comment #13)
> Miroslav, it should still be triaged by using a build of xen-105 + the patch
> for bug 574540.  I'll place the test build at
> http://people.redhat.com/pbonzini/bz653271 in ~1 hour.

I did this testing. e1000 does not work after restore. If we save with e1000 and restore with rtl8139 it works.

Comment 16 Laszlo Ersek 2011-07-29 14:23:17 UTC

Possibly related to bug 723755.

Comment 17 Laszlo Ersek 2011-08-16 10:08:01 UTC

Host:
- x86_64 RHEL-5.7
- 2.6.18-278.el5xen
- xen-3.0.3-133.el5

Guest:
- i686 RHEL-5.7
- 2.6.18-274.el5 
- eth0: e1000

Confirming problem; after save/restore I've gotten the same dump as in attachment 460483 [details]:

BUG: soft lockup - CPU#3 stuck for 60s! [swapper:0]

Pid: 0, comm:              swapper
EIP: 0060:[<f09c7b71>] CPU: 3
EIP is at e1000_intr+0x9b/0x106 [e1000]
 EFLAGS: 00000293    Not tainted  (2.6.18-274.el5 #1)
EAX: ffffffff EBX: eccb1000 ECX: 00000286 EDX: 00000286
ESI: eccb1400 EDI: eccb1610 EBP: 00000005 DS: 007b ES: 007b
CR0: 8005003b CR2: b7f15000 CR3: 2d7ec000 CR4: 000006d0
 [<c045019d>] handle_IRQ_event+0x45/0x8c
 [<c0450297>] __do_IRQ+0xb3/0x104
 [<c04501e4>] __do_IRQ+0x0/0x104
 [<c04074d8>] do_IRQ+0x9b/0xc3
 [<c040597a>] common_interrupt+0x1a/0x20
 [<c042a72d>] __do_softirq+0x57/0x114
 [<c04073f9>] do_softirq+0x4e/0x92
 [<c04501e4>] __do_IRQ+0x0/0x104
 [<c04074f4>] do_IRQ+0xb7/0xc3
 [<c040597a>] common_interrupt+0x1a/0x20
 [<c0403c1c>] default_idle+0x0/0x59
 [<c0403c4d>] default_idle+0x31/0x59
 [<c0403d14>] cpu_idle+0x9f/0xb9
 =======================
hda: DMA interrupt recovery
hda: lost interrupt
hda: dma_timer_expiry: dma status == 0x24
hda: DMA interrupt recovery
hda: lost interrupt
hda: dma_timer_expiry: dma status == 0x24

INFO: task events/0:14 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
events/0      D 00000054  3440    14      1            15    13 (L-TLB)
       effeaef8 00000046 d0b59be8 00000054 eccb1000 c07a18e0 00000292 00000009 
       c1704550 d0b5a214 00000054 0000062c 00000000 c170465c c1606580 ee0843c0 
       00000003 00000000 c06bd4c0 c06bd504 c1704550 c1606580 c0620458 ffffffff 
Call Trace:
 [<c0620458>] schedule+0xbc/0xa57
 [<c0620ea5>] wait_for_completion+0x6b/0x8f
 [<c041f843>] default_wake_function+0x0/0xc
 [<c04351c0>] synchronize_rcu+0x2a/0x2f
 [<c0434e11>] wakeme_after_rcu+0x0/0x8
 [<c05d37bd>] dev_deactivate+0x76/0xa1
 [<c05cd2f1>] __linkwatch_run_queue+0x163/0x197
 [<c05cd342>] linkwatch_event+0x1d/0x22
 [<c0433dd1>] run_workqueue+0x81/0xc5
 [<c05cd325>] linkwatch_event+0x0/0x22
 [<c0434834>] worker_thread+0xd9/0x10d
 [<c041f843>] default_wake_function+0x0/0xc
 [<c043475b>] worker_thread+0x0/0x10d
 [<c0436c6e>] kthread+0xc0/0xee
 [<c0436bae>] kthread+0x0/0xee
 [<c0405c87>] kernel_thread_helper+0x7/0x10
 =======================
INFO: task irqbalance:2246 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
irqbalance    D 00000058  3172  2246      1          2264  2178 (NOTLB)
       ed0c9e98 00000086 b8a0d8d4 00000058 00000000 c0473b20 00000015 00000009 
       eee93aa0 b8a2ebb0 00000058 000212dc 00000001 eee93bac c160d3c4 eea31200 
       00000104 00000000 00000010 00000514 00000010 00100000 00000246 ffffffff 
Call Trace:
 [<c0473b20>] cache_alloc_refill+0xdd/0x48a
 [<c06217af>] __mutex_lock_slowpath+0x4d/0x7c
 [<c06217ed>] .text.lock.mutex+0xf/0x14
 [<c05c5c14>] dev_load+0x14/0x39
 [<c05baeb9>] sock_ioctl+0x0/0x1b3
 [<c05c686b>] dev_ioctl+0x2ed/0x462
 [<c04ca8dd>] inode_has_perm+0x54/0x5c
 [<c05bb04a>] sock_ioctl+0x191/0x1b3
 [<c05baeb9>] sock_ioctl+0x0/0x1b3
 [<c0487cb9>] do_ioctl+0x1c/0x5d
 [<c048824d>] vfs_ioctl+0x47b/0x4d3
 [<c04882ed>] sys_ioctl+0x48/0x5f
 [<c0404f4b>] syscall_call+0x7/0xb
 =======================

And the guest becomes unusable.

Changing the eth0 device model to rtl8139, the guest seems to work okay after restore.

Comment 18 Laszlo Ersek 2011-08-16 12:53:56 UTC

This bug could be caused by the e1000 driver in the guest kernel (and then the component should be set to "kernel"), or in the e1000 emulation of qemu-dm (and then the component should be set to "xen"). Since the same bare-metal kernel seems to be working with real e1000 cards, I'm resetting the component to "xen".

I tried to look at how e1000 save/restore works in upstream qemu-dm.

In the e1000 save/restore functions, the arrays listing registers-to-be-saved were first unfolded [1] [2], then the save functionality was ported to "vmstate" [3]. The last version of the e1000 emulation before [1] is 7c131dd5. Running "git blame hw/e1000.c" on that version, and looking at nic_save(), we arrive at [4], which removed mmio_base, which is one (but surely not the only) proof that the e1000 save logic saw real changes after the RHEL-5 version was forked.

In addition to resetting the component to xen, I'm closing this as WONTFIX. The patch under [4] is huge and there's no guarantee it would fix our problem.

e1000 support was added to RHEL-5 qemu-dm in response to bug 344861 -- to enable Windows 2003 Server guests on IA64 hosts (xen-3.0.3-72.el5). A further change was made for bug 574540 in xen-3.0.3-107.el5 so that the e1000 setting survives a reboot from within guest. (See KB article 27773 linked from bug 574540: the article states in the Environment section that the fix is targeted at IA64 hosts and Win2K3 guests. See also bug 574540 comment 4.)

It was verified in comment 14 that restoring from e1000 never worked (this fact may have been masked by bug 574540 for a while).

After installation & reboot, the Win2K3 guest can setup the xenpv-win drivers (the xenpv-win README lists "Windows(R) 2003 Server Service Pack 2 (x86 and x64)" as supported).

This bug was reported internally. After the customer-facing bug 574540 was opened 1.5 years ago, no customer interest appears to be expressed in this bug, ie. in save/restore/migration with e1000. I think it is not worth the effort and risks for 5.8.

Feel free to reopen / revert my changes if I'm wrong.

[1] http://xenbits.xen.org/gitweb/?p=people/aperard/qemu-dm.git;a=commitdiff;h=2e885049
[2] http://xenbits.xen.org/gitweb/?p=people/aperard/qemu-dm.git;a=commitdiff;h=28366c3a
[3] http://xenbits.xen.org/gitweb/?p=people/aperard/qemu-dm.git;a=commitdiff;h=e482dc3e
[4] http://xenbits.xen.org/gitweb/?p=people/aperard/qemu-dm.git;a=commitdiff;h=8da3ff18

Note You need to log in before you can comment on or make changes to this bug.