Bug 808391

Summary:

Guest does not work well and hit some memory problem after repeatedly do S3/wakeup

Product:

Red Hat Enterprise Linux 6

Reporter:

Qunfang Zhang <qzhang>

Component:

qemu-kvm

Assignee:

Marcelo Tosatti <mtosatti>

Status:

CLOSED WONTFIX

QA Contact:

Virtualization Bugs <virt-bugs>

Severity:

high

Docs Contact:

Priority:

high

Version:

6.3

CC:

acathrow, amit.shah, areis, bsarathy, jburke, jogreene, juzhang, knoel, michen, mkenneth, mtosatti, qiguo, qzhang, rhod, shuang, sluo, virt-bugs, virt-maint

Target Milestone:

Keywords:

Reopened

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Clones:

968167 1105682 (view as bug list)

Environment:

Last Closed:

2014-06-05 22:15:31 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

761491, 912287, 1105682

Attachments:

Description	Flags
isa log for the s3 problem.	none
dmesg from serial console when guest is in hang state.	none
log_of_after_resume_from_S4	none

Description Qunfang Zhang 2012-03-30 10:24:25 UTC

Description of problem:
Suspend a guest to memory and wakeup it, guest can not wakeup successfully and there's a lot of calltrace and memory related thing prompt. 
(isa-serial log will be upload)

Test on physical host for 30 times, did not hit this issue.
Did not use the kvmclock.
Hit it with and without virtio devices.
When using virtio block, I used the kernel Amit provided in https://bugzilla.redhat.com/show_bug.cgi?id=803187#c10, which fix the S3&virtio issue.

Version-Release number of selected component (if applicable):
Host:
kernel-2.6.32-251.el6.x86_64
qemu-kvm-0.12.1.2-2.265.el6.x86_64

Guest:
2.6.32-251.el6.x86_64
kernel-2.6.32-254.el6bz803187 (https://bugzilla.redhat.com/show_bug.cgi?id=803187#c10)

How reproducible:
1/20.
Actually hit it 3 times today. and the last time I hit it, it is during a 20 times repeatedly test.

Steps to Reproduce:
1. Boot a rhel6.3 guest with the kernel version described above.
/usr/libexec/qemu-kvm -M rhel6.3.0 -cpu Conroe,-kvmclock -enable-kvm -m 2G -smp 2,sockets=1,cores=2,threads=1 -name rhel6.3 -uuid 4c84db67-faf8-4498-9829-19a3d6431d9d -rtc base=localtime,driftfix=slew -drive file=/home/rhel6.3-64.qcow2,if=none,id=drive-virtio-disk0,format=qcow2,cache=none,werror=stop,rerror=stop,aio=native -device virtio-blk-pci,bus=pci.0,drive=drive-virtio-disk0,id=virtio-disk0,addr=0x5 -netdev tap,id=hostnet0,vhost=on,script=/etc/qemu-ifup -device virtio-net-pci,netdev=hostnet0,id=net0,mac=00:1a:2a:42:10:66,bus=pci.0,addr=0x3 -usb -device usb-tablet,id=input0 -boot c -monitor stdio  -drive if=none,media=cdrom,id=drive-ide0-1-0,readonly=on,format=raw -device ide-drive,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -vnc :10  -qmp tcp:0:4444,server,nowait -chardev socket,path=/tmp/qzhang-test,server,nowait,id=isa1 -device isa-serial,chardev=isa1,id=isa-serial1 -device virtio-serial-pci,id=virtio-serial0,max_ports=16,bus=pci.0,addr=0x4 -chardev socket,path=/tmp/qga.sock,server,nowait,id=qga0 -device virtserialport,chardev=qga0,name=org.qemu.guest_agent.0,bus=virtio-serial0.0 -device virtio-balloon-pci,bus=pci.0,id=balloon0

2. Connect to isa-serial on host to catch the log:
(on host) #nc -U /tmp/qzhang-test

3. Suspend guest to memory:
#pm-suspend

4. Wakeup guest by: clicking ps/2 mouse/keyboard, sending "system_wakeup" command.

5. Repeat step 3-4 for more than 20 times
  
Actual results:
Guest can not wakeup and come back to the desktop, got a lot of calltrace and error in the isa-serial log. 

Expected results:
Guest should suspend to mem and wakeup successfully all the time.

Additional info:

Comment 1 Qunfang Zhang 2012-03-30 10:25:10 UTC

Created attachment 573944 [details]
isa log for the s3 problem.

Comment 2 Qunfang Zhang 2012-03-30 10:31:08 UTC

Repeatedly do S3/wakeup guest, will be easily to reproduce this issue.

Comment 4 Qunfang Zhang 2012-03-31 08:20:28 UTC

Re-test the following packages, still can reproduce the issue even on the first attempt:
kernel: 
kernel-2.6.32-244.el6.x86_64
kernel-2.6.32-254.el6.x86_64
kernel-2.6.32-259.el6.x86_64

qemu-kvm:
qemu-kvm-0.12.1.2-2.246.el6.x86_64
qemu-kvm-0.12.1.2-2.265.el6.x86_64

seabios:
seabios-0.6.1.2-13.el6.x86_64
seabios-0.6.1.2-16.el6.x86_64

Comment 5 Qunfang Zhang 2012-04-05 09:23:09 UTC

I raise the severity and priority because this is very easy to reproduce. Sometimes it will happens at the first time S3/resume. And sometimes it may not occur after 3-4 times test, but if you reboot the guest at this time, maybe hit the problem.

Comment 7 Qunfang Zhang 2012-04-05 10:29:21 UTC

1. It happens with and without virtio devices.
2. It happens with and without kvmclock.
3. after i input "#pm-suspend" or "echo mem > /sys/power/state", the guest suspend to mem and it displays black screen, at this moment, there's no message output in the serial log.  and then i wakeup guest , then hit the problem.
4. It will be easy to reproduce with a 64bit 6.3 guest. And I tested rhel6.3-32 guest for about 20 times, can not reproduce.

Comment 8 Amit Shah 2012-04-05 10:42:14 UTC

Thanks for that info.  Some more questions:

1. Does this still happen if you increase guest RAM size?
2. Does it happen without X in guest (booting into init 3), with lower ram, like 1G?
3. How much swap space does the guest have?  How much is the usage of ram and swap before going into s3?  For a successful wakeup, how much swap and ram usage does guest report after wakeup ('free' output)?

Comment 9 Qunfang Zhang 2012-04-05 11:16:10 UTC

(In reply to comment #8)
> Thanks for that info.  Some more questions:
> 
> 1. Does this still happen if you increase guest RAM size?
Increase guest RAM from 2G to 4G, still reproduce: and the usage of ram after hitting the problem:

[root@localhost ~]# free -m
free -m
free -m
             total       used       free     shared    buffers     cached
Mem:          3831        604       3226          0         22        300
-/+ buffers/cache:        281       3550
Swap:         4031          0       4031
[root@localhost ~]# 

> 2. Does it happen without X in guest (booting into init 3), with lower ram,
> like 1G?

Still reproduce, and the usage of guest ram after hitting the problem:
[root@localhost ~]# free -m
free -m
free -m
             total       used       free     shared    buffers     cached
Mem:           996        277        719          0         20        102
-/+ buffers/cache:        153        842
Swap:         4031          0       4031
[root@localhost ~]# 


> 3. How much swap space does the guest have?  How much is the usage of ram and
> swap before going into s3?  For a successful wakeup, how much swap and ram
> usage does guest report after wakeup ('free' output)?

before the issue:

[root@localhost ~]# free -m
free -m
free -m
             total       used       free     shared    buffers     cached
Mem:          1877        494       1383          0         28        206
-/+ buffers/cache:        258       1619
Swap:         4031          0       4031
[root@localhost ~]# 


after the issue:
             total       used       free     shared    buffers     cached
Mem:          1877        442       1434          0         22        295
-/+ buffers/cache:        124       1752
Swap:         4031          0       4031
[root@localhost ~]# 


a successful wakeup:

[root@localhost ~]#  free -m
 free -m
 free -m
             total       used       free     shared    buffers     cached
Mem:          1877        495       1382          0         29        206
-/+ buffers/cache:        259       1618
Swap:         4031          0       4031


Additional info:
I tried rhel6.2 kernel-220, did not reproduce after 10 times attempts. Then I upgrade kernel to -259, reproduced at the 3nd time.
I will try more kernel version tomorrow and update the result here.

Comment 10 Amit Shah 2012-04-05 11:57:41 UTC

Does disabling thp in the guest before entering s3 help?

Also, can you check by entering 'init 2' or issuing 'service network stop' before entering s3 to see if it helps?

Comment 11 Qunfang Zhang 2012-04-06 08:41:59 UTC

(In reply to comment #10)
> Does disabling thp in the guest before entering s3 help?
It did not help. I disabled thp and then do s3/wakup, reproduced at the first attempt.

> 
> Also, can you check by entering 'init 2' or issuing 'service network stop'
> before entering s3 to see if it helps?
'init 2' helps. After entering 'init 2' and then do lots of cycle of s3/wakeup, I did not hit the problem after more than 20 times attempts.

Comment 12 Qunfang Zhang 2012-04-06 09:30:51 UTC

Update:
1. boot guest with '-net none', can not reproduce after 20 times attempts.
2. boot guest with e1000 nic, but stop NetworkManager service inside guest, can not reproduce after 20 times attempts. 
3. start 'NetworkManager' service again, easily to reproduce especially reboot guest -> start NetworkManager service -> pm-suspend.

Comment 13 Amit Shah 2012-04-06 16:45:39 UTC

So: the system has enough free RAM and almost all swap space free, but the OOM killer is still invoked.

The attached logs show that it's the nm-applet process that causes the OOM killer to activate:

Restarting tasks ... done.
e1000: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX
ADDRCONF(NETDEV_UP): eth0: link is not ready
ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
nm-applet invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0, oom_score_adj=0
nm-applet cpuset=/ mems_allowed=0
Pid: 2648, comm: nm-applet Not tainted 2.6.32-254.el6bz803187.x86_64 #1


This has been reproduced with e1000 as well as virtio-net devices.

The OOM killer gets invoked several times in the same run:

Out of memory: Kill process 1495 (rsyslogd) score 1 or sacrifice child
Killed process 1495, UID 0, (rsyslogd) total-vm:249072kB, anon-rss:544kB, file-rss:1076kB
nm-applet invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0, oom_score_adj=0
nm-applet cpuset=/ mems_allowed=0
Pid: 2648, comm: nm-applet Not tainted 2.6.32-254.el6bz803187.x86_64 #1


Out of memory: Kill process 1543 (rpcbind) score 1 or sacrifice child
Killed process 1543, UID 32, (rpcbind) total-vm:18968kB, anon-rss:248kB, file-rss:688kB
nm-applet invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0, oom_score_adj=0
nm-applet cpuset=/ mems_allowed=0
Pid: 2648, comm: nm-applet Not tainted 2.6.32-254.el6bz803187.x86_64 #1


The OOM killer keeps killing tasks till it finally lands upon the nm-applet task:
Out of memory: Kill process 2648 (nm-applet) score 1 or sacrifice child
Killed process 2648, UID 0, (nm-applet) total-vm:310212kB, anon-rss:3504kB, file-rss:9128kB
swap_free: Bad swap offset entry 00800000
BUG: Bad page map in process nm-applet  pte:100000000 pmd:37edf067
addr:00007f2c12801000 vm_flags:08000071 anon_vma:(null) mapping:ffff88007bc6cde0 index:7a
vma->vm_ops->fault: filemap_fault+0x0/0x500
vma->vm_file->f_op->mmap: ext4_file_mmap+0x0/0x60 [ext4]

... and from then on, this keeps repeating.

Comment 14 Marcelo Tosatti 2012-04-09 00:30:22 UTC

Qunfang,

Is this an Intel host? If so, can you please reload the kvm_intel module
with

rmmod kvm_intel
modprobe kvm_intel enable_ept=0

And retry the test?

Comment 15 Qunfang Zhang 2012-04-09 02:35:37 UTC

(In reply to comment #14)
> Qunfang,
> 
> Is this an Intel host? If so, can you please reload the kvm_intel module
> with
> 
> rmmod kvm_intel
> modprobe kvm_intel enable_ept=0
> 
> And retry the test?

Hi, Marcelo
Yes, this is an Intel host and it does not support ept.

CPU info:

processor	: 3
vendor_id	: GenuineIntel
cpu family	: 6
model		: 23
model name	: Intel(R) Core(TM)2 Quad CPU    Q9550  @ 2.83GHz
stepping	: 10
cpu MHz		: 2826.254
cache size	: 6144 KB
physical id	: 0
siblings	: 4
core id		: 3
cpu cores	: 4
apicid		: 3
initial apicid	: 3
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good aperfmperf pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm sse4_1 xsave lahf_lm dts tpr_shadow vnmi flexpriority
bogomips	: 5652.50
clflush size	: 64
cache_alignment	: 64
address sizes	: 36 bits physical, 48 bits virtual
power management:

Comment 16 Qunfang Zhang 2012-04-11 03:13:59 UTC

Update:
Re-test on the following hosts with kernel-262 installed on both host and guest:

1. Intel host with ept supported and ept is enabled. (hard to reproduce, I only hit once.)
2. Intel host without ept supported. (easily to reproduce.)
2. AMD host with npt supported and npt is enabled.
3. AMD host with npt supported but npt is disabled.

Comment 17 Qunfang Zhang 2012-04-11 03:16:05 UTC

(In reply to comment #16)
> Update:
> Re-test on the following hosts with kernel-262 installed on both host and
> guest:
> 
> 1. Intel host with ept supported and ept is enabled. (hard to reproduce, I only
> hit once.)
> 2. Intel host without ept supported. (easily to reproduce.)
> 2. AMD host with npt supported and npt is enabled.
Reproduced.

> 3. AMD host with npt supported but npt is disabled.
Reproduced.

Comment 21 RHEL Program Management 2012-07-10 07:17:17 UTC

This request was not resolved in time for the current release.
Red Hat invites you to ask your support representative to
propose this request, if still desired, for consideration in
the next release of Red Hat Enterprise Linux.

Comment 22 RHEL Program Management 2012-07-11 02:04:35 UTC

This request was erroneously removed from consideration in Red Hat Enterprise Linux 6.4, which is currently under development.  This request will be evaluated for inclusion in Red Hat Enterprise Linux 6.4.

Comment 23 Marcelo Tosatti 2012-08-08 20:33:52 UTC

Qunfang Zhang,

I cannot reproduce this problem (after 100+ S3 suspend/resume
cycles).

This is the qemu command line:

/usr/libexec/qemu-kvm -name Intel_w35202_8G4smp -monitor telnet::4445,server,nowait \
-net nic,model=e1000 -net tap,script=/root/ifup.sh \
-m 2000 \
-vnc :3 -rtc base=localtime,driftfix=slew -boot \
order=cdn,menu=off -usbdevice tablet -enable-kvm \
-drive file=rhel63.img,index=0,if=none,id=drive-ide0-0-0,media=disk,cache=writeback,format=qcow2,aio=native -device ide-drive,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0 -serial stdio -smp 2 -cpu qemu64,-pat -bios /usr/share/seabios/bios-pm.bin


guest: 2.6.32-279.el6.x86_64
NetworkManager is running.

host:
- kernel 2.6.32-220.el6 + fix for BZ817243 (which should be unrelated since
migration is not being used).
- qemu-kvm-0.12.1.2-2.295.el6
- seabios-0.6.1.2-19.el6

Please attempt to reproduce it with a recent kernel in the guest.

Comment 24 Qunfang Zhang 2012-08-09 08:34:39 UTC

Hi, Marcelo
Acutally, I could not reproduce it now after 200+ times loop on rhel6.3 released guest.

Comment 25 Marcelo Tosatti 2012-08-09 11:02:27 UTC

Thanks, Qunfang Zhang.

As we are unable to reproduce it (ran another 500 or so loop yesterday), closing as WORKSFORME.

Comment 26 Qunfang Zhang 2012-08-10 03:04:53 UTC

(In reply to comment #25)
> Thanks, Qunfang Zhang.
> 
> As we are unable to reproduce it (ran another 500 or so loop yesterday),
> closing as WORKSFORME.

It's ok and welcome. If I reproduce it later or find the point to reproduce it will update here.

Comment 27 Qian Guo 2013-06-08 02:50:48 UTC

Reproduced this bug  by kernel-2.6.32-358.11.1.el6.x86_64 and qemu-kvm-0.12.1.2-2.355.el6_4.3.x86_64 in a intel host.

During scp a file to a host from guest, do S4, guest can not resume, but from serial console, cat same log as reporter, so I reopened this bug


Version-Release number of selected component (if applicable):
host:
uname -r
2.6.32-358.11.1.el6.x86_64
# rpm -qa|grep qemu-kvm
qemu-kvm-0.12.1.2-2.355.el6_4.3.x86_64

guest kernel:
# uname -r
2.6.32-358.el6.x86_64

How reproducible:
100%

Steps to Reproduce:
1.Boot RHEL6.5 guest
# /usr/libexec/qemu-kvm -cpu Penryn -m 2048 -smp 2,sockets=1,cores=2,threads=1 -M pc -enable-kvm -name rel6u41 -drive file=/home/rhel6u5.qcow2,if=none,format=qcow2,werror=stop,rerror=stop,cache=none,id=drive-blk -device virtio-blk-pci,drive=drive-blk,id=virtio-disk0 -nodefaults -nodefconfig -monitor stdio -netdev tap,id=netdev0,script=/etc/qemu-ifup -device e1000,netdev=netdev0,id=vnic1,mac=22:61:c0:6b:e8:e7 -vga qxl -spice port=5901,disable-ticketing -global PIIX4_PM.disable_s3=0 -global PIIX4_PM.disable_s4=0 -serial unix:/tmp/qiguos1,server,nowait

2.Inside guest, scp file(I use a 2G file) to host, and meanwhile, do S4
# pm-hibernate

3.Try to resume guest.

Actual results:
guest hang 


I will attach the log

Comment 28 Qian Guo 2013-06-08 02:52:47 UTC

Created attachment 758402 [details]
dmesg from serial console when guest is in hang state.

Comment 29 Qian Guo 2013-06-08 02:57:04 UTC

*** Bug 968167 has been marked as a duplicate of this bug. ***

Comment 30 Marcelo Tosatti 2013-06-13 01:26:37 UTC

Qian Guo,

Please reproduce with a kdump enabled kernel and with panic_on_oom enabled:

sysctl -w vm.panic_on_oom=1

Then attach the dump file to the BZ.

Thanks

Comment 31 Qian Guo 2013-06-17 02:43:56 UTC

(In reply to Marcelo Tosatti from comment #30)
> Qian Guo,
> 
> Please reproduce with a kdump enabled kernel and with panic_on_oom enabled:
> 
> sysctl -w vm.panic_on_oom=1
> 
> Then attach the dump file to the BZ.
> 
Ok, I will try to reproduce it and update here.
> Thanks

Comment 32 Qian Guo 2013-06-20 08:24:06 UTC

(In reply to Marcelo Tosatti from comment #30)
> Qian Guo,
> 
> Please reproduce with a kdump enabled kernel and with panic_on_oom enabled:
> 
> sysctl -w vm.panic_on_oom=1
> 
> Then attach the dump file to the BZ.
> 

The guest's kernel is kdump enabled and panic_on_oom enabled, and reproduced but can not got the dump file.
btw, can not reproduced by latest rhel6.5 qemu-kvm version, just by qemu-kvm-0.12.1.2-2.355.el6_4.3.x86_64


> Thanks

Comment 36 Qian Guo 2013-06-25 04:28:43 UTC

Created attachment 764912 [details]
log_of_after_resume_from_S4

Comment 60 Marcelo Tosatti 2014-03-07 02:08:23 UTC

Patch posted: http://patchwork.ozlabs.org/patch/327724/

Comment 61 Marcelo Tosatti 2014-03-12 01:14:14 UTC

(In reply to Marcelo Tosatti from comment #60)
> Patch posted: http://patchwork.ozlabs.org/patch/327724/

Failing patch.  This patch causes the system to lose connectivity after running ethtool diagnostics.  When connectivity is lost ethtool shows it as having a
+valid link, but a simple ping fails to connect.  An ethtool -r will successfully bring connectivity back.  Here is an example
session:
==================================================================
u1464:[0]/usr/src/kernels/net-next_community> ping u0464-1 PING u0464-1 (190.1.4.64) 56(84) bytes of data.
64 bytes from u0464-1 (190.1.4.64): icmp_seq=1 ttl=64 time=0.378 ms
64 bytes from u0464-1 (190.1.4.64): icmp_seq=2 ttl=64 time=0.165 ms
64 bytes from u0464-1 (190.1.4.64): icmp_seq=3 ttl=64 time=0.095 ms ^C
--- u0464-1 ping statistics ---

Comment 65 John Greene 2014-04-25 19:25:12 UTC

Marcelo, 
Sorry for the delay posting your patch.  I was working on an issue all week in this area and found an upstream patch that may be interesting to you.  I've ported to 6.3.z for a customer test. It changes the watchdog/reset logic and locking.

Here it is: would like to review your patch against it and set if we still need both, while the code path is still fresh to me.

in stable: b2f963bfaebadc9117b29f806630ea3bcaec403d
 e1000: fix lockdep warning in e1000_reset_task

As an aside, this patch is for upstream I know, the BZ is on 6.x.  Assume your fix needed on 6.x?

Comment 81 Marcelo Tosatti 2014-05-14 10:21:40 UTC

*** Bug 869971 has been marked as a duplicate of this bug. ***

Comment 82 Qunfang Zhang 2014-05-20 08:10:50 UTC

Some update:

I have borrowed 3 types of hosts from beaker that includes 3 e1000e NIC (device_id  0x155a, 0x153a, 0x10d3). 

Just now, I tested the bug with comment 0 steps on the host which has 0x10d3 device_id e1000e NIC. 

Host info:
[root@hp-dl388g8-10 ~]# lspci | grep Ether
03:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01)
03:00.1 Ethernet controller: Broadcom Corporation NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01)
03:00.2 Ethernet controller: Broadcom Corporation NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01)
03:00.3 Ethernet controller: Broadcom Corporation NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01)
04:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM57810 10 Gigabit Ethernet (rev 10)
04:00.1 Ethernet controller: Broadcom Corporation NetXtreme II BCM57810 10 Gigabit Ethernet (rev 10)
07:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5720 Gigabit Ethernet PCIe
07:00.1 Ethernet controller: Broadcom Corporation NetXtreme BCM5720 Gigabit Ethernet PCIe
0a:00.0 Ethernet controller: Intel Corporation 82574L Gigabit Network Connection
.....

There are more than 10 NIC cards plugged in the host, and the 0a:00.0 is the target one I want to test. 

[root@hp-dl388g8-10 ~]# lspci  -nvv -s 0a:00.0
0a:00.0 0200: 8086:10d3
                   ^^^^^^ (device_id is 10d3)
	Subsystem: 8086:0001
	Physical Slot: 3

# ifconfig 
eth2      Link encap:Ethernet  HWaddr D8:9D:67:13:2E:30  
          inet addr:10.66.86.175  Bcast:10.66.87.255  Mask:255.255.254.0
          inet6 addr: 2620:52:0:4257:da9d:67ff:fe13:2e30/64 Scope:Global
          inet6 addr: fe80::da9d:67ff:fe13:2e30/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:2820765 errors:0 dropped:0 overruns:0 frame:0
          TX packets:724302 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:4234636566 (3.9 GiB)  TX bytes:188220334 (179.5 MiB)
          Interrupt:32 

eth8      Link encap:Ethernet  HWaddr 68:05:CA:06:0D:7E  
          inet6 addr: fe80::6a05:caff:fe06:d7e/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:5573 errors:0 dropped:0 overruns:0 frame:0
          TX packets:144 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:504293 (492.4 KiB)  TX bytes:20560 (20.0 KiB)
          Interrupt:16 Memory:f3fe0000-f4000000 

lo        Link encap:Local Loopback  
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)

switch    Link encap:Ethernet  HWaddr 68:05:CA:06:0D:7E  
          inet addr:192.168.1.186  Bcast:192.168.1.255  Mask:255.255.255.0
          inet6 addr: 2001::48b7:bdff:fedc:373d/64 Scope:Global
          inet6 addr: 2001::6a05:caff:fe06:d7e/64 Scope:Global
          inet6 addr: fe80::6a05:caff:fe06:d7e/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:1154 errors:0 dropped:0 overruns:0 frame:0
          TX packets:21 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:75119 (73.3 KiB)  TX bytes:3462 (3.3 KiB)

[root@hp-dl388g8-10 ~]# ethtool -i eth8
driver: e1000e
version: 2.3.2-k
firmware-version: 2.1-0
bus-info: 0000:0a:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: no


The eth8 interface which uses the 10d3 device_id e1000e card has an private IP address, and I setup the bridge "switch" on it and make sure guest is using this bridge to get network address.

But, unfortunately, I still could not reproduce the but with comment 0 steps with the following command line:
 /usr/libexec/qemu-kvm -M rhel6.5.0 -cpu Conroe -enable-kvm -m 2G -smp 2,sockets=1,cores=2,threads=1 -name rhel6.3 -uuid 4c84db67-faf8-4498-9829-19a3d6431d9d -rtc base=localtime,driftfix=slew -drive file=/home/RHEL-Server-6.5-64-virtio.qcow2,if=none,id=drive-virtio-disk0,format=qcow2,cache=none,werror=stop,rerror=stop,aio=native -device virtio-blk-pci,bus=pci.0,drive=drive-virtio-disk0,id=virtio-disk0,addr=0x5 -netdev tap,id=hostnet0,vhost=on,script=/etc/qemu-ifup -device virtio-net-pci,netdev=hostnet0,id=net0,mac=00:1a:2a:42:10:66,bus=pci.0,addr=0x3 -usb -device usb-tablet,id=input0 -boot c -monitor stdio  -drive if=none,media=cdrom,id=drive-ide0-1-0,readonly=on,format=raw -device ide-drive,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -vnc :10  -qmp tcp:0:4444,server,nowait -chardev socket,path=/tmp/qzhang-test,server,nowait,id=isa1 -device isa-serial,chardev=isa1,id=isa-serial1 -device virtio-serial-pci,id=virtio-serial0,max_ports=16,bus=pci.0,addr=0x4 -chardev socket,path=/tmp/qga.sock,server,nowait,id=qga0 -device virtserialport,chardev=qga0,name=org.qemu.guest_agent.0,bus=virtio-serial0.0 -device virtio-balloon-pci,bus=pci.0,id=balloon0 -global PIIX4_PM.disable_s3=0 -global PIIX4_PM.disable_s4=0

So I plan to have a try with comment 27 scenario later. 

The package version I used:
Host:
kernel-2.6.32-431.el6.x86_64
qemu-kvm-0.12.1.2-2.426.el6.x86_64
Guest:
kernel-2.6.32-431.el6.x86_64

Comment 83 John Greene 2014-05-20 13:53:37 UTC

Thanks for the updates and your efforts.

Also available to you (while brew keeps it, if not, I'll make available) is the latest RHEL 6.6 build from upstream.  Might be interesting test for you if you would like to see that as well. THere are a few patches that adjust the tx path that might make things work as this is somewhat a timing issue.

https://brewweb.devel.redhat.com/taskinfo?taskID=7466674

Comment 87 Ademar Reis 2014-06-05 22:15:31 UTC

S3/S4 support is tech-preview in RHEL6 and it'll be promoted to fully supported
at some point, but only in RHEL7.

Therefore we're closing all S3/S4 related bugs in RHEL6. New bugs will be
considered only if they're regressions or break some important use-case or
certification.

RHEL7 is being more extensively tested and effort from QE is underway in
certifying that this particular bug is not present there.

Please reopen with a justification if you believe this bug should not be
closed. We'll consider them on a case-by-case basis following a best effort
approach.


Thank you.

Comment 88 John Greene 2014-06-06 12:53:20 UTC

I had ported the DMA patch in question to RHEL7 already.  Given comment 87, it would seem more reasonable to chase the issues (DMA memory corruption and ethtool diagnostics causing failure of link (comment 61).

Any problem dup'ing this to RHEL 7 then?  Seems a couple issues that need to be addressed there.

Comment 89 Marcelo Tosatti 2014-06-06 17:08:06 UTC

(In reply to John Greene from comment #88)
> I had ported the DMA patch in question to RHEL7 already.  Given comment 87,
> it would seem more reasonable to chase the issues (DMA memory corruption and
> ethtool diagnostics causing failure of link (comment 61).
> 
> Any problem dup'ing this to RHEL 7 then?  Seems a couple issues that need to
> be addressed there.

John,

Please do so as we know the problem is still present.