Bug 507659 - Migrate command not end and vm responseless on Nahalem host
Summary: Migrate command not end and vm responseless on Nahalem host
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kvm
Version: 5.4
Hardware: All
OS: Linux
high
urgent
Target Milestone: rc
: ---
Assignee: Glauber Costa
QA Contact: Lawrence Lim
URL:
Whiteboard:
Depends On:
Blocks: LiveMigration
TreeView+ depends on / blocked
 
Reported: 2009-06-23 16:42 UTC by lihuang
Modified: 2014-03-26 00:58 UTC (History)
10 users (show)

Fixed In Version: kvm-83-93.el5
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2009-09-02 09:26:31 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
sosreport (2.95 MB, application/x-bzip2)
2009-07-11 11:29 UTC, lihuang
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2009:1272 0 normal SHIPPED_LIVE New package: kvm 2009-09-01 09:34:32 UTC

Description lihuang 2009-06-23 16:42:55 UTC
Description of problem:
On the Nahalem host,sometimes both dst and src guest become responseless after implement migration. So far,can not reproduce the issue in other CPU by the same scenario.


from the src guest.it seems something wrong with/after (?) the last 10M ram copy
After printing the notification message. I can not input anything to monitor even Ctrl+C. 

CLI: 
root     19373 18512  0 13:33 pts/1    00:00:25 /usr/libexec/qemu-kvm -drive file=/mnt/rhel53-64-ser-virtio.qcow2,if=virtio,cache=off,index=0,boot=on -net nic,macaddr=20:20:20:00:18:59,model=virtio -net tap,script=/etc/qemu-ifup -rtc-td-hack -no-hpet -usbdevice tablet -cpu qemu64,+sse2 -smp 2 -m 4096 -vnc :12 -monitor stdio -boot c -incoming tcp:0:5800

guest is a fresh installed 5u3 x86_64 server  (default installation)

(qemu) migrate -d tcp:xx.xx.xx.xx:5800
(qemu) migration_notify_done

(qemu) info  migrate 
Migration status: active
transferred ram: 44340 kbytes
remaining ram: 4170456 kbytes
total ram: 4214796 kbytes
(qemu) info  migrate 
Migration status: active
transferred ram: 64312 kbytes
remaining ram: 4150484 kbytes
total ram: 4214796 kbytes
(qemu) info  migrate 
Migration status: active
transferred ram: 4093728 kbytes
remaining ram: 121068 kbytes
total ram: 4214796 kbytes
(qemu) info  migrate 
Migration status: active
transferred ram: 4107992 kbytes
remaining ram: 106804 kbytes
total ram: 4214796 kbytes
(qemu) migration_notify_done



top on src host:
Tasks:   5 total,   0 running,   5 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  12291060k total,  6143348k used,  6147712k free,    17480k buffers
Swap: 10342392k total,    20436k used, 10321956k free,  1728108k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND            
19373 root      25   0 4357m 4.1g 3900 S  0.0 34.7   0:18.18 qemu-kvm           
19380 root      15   0 4357m 4.1g 3900 S  0.0 34.7   0:00.02 qemu-kvm           
19381 root      15   0 4357m 4.1g 3900 S  0.0 34.7   0:03.39 qemu-kvm           
19382 root      15   0 4357m 4.1g 3900 S  0.0 34.7   0:03.27 qemu-kvm           
19384 root      15   0 4357m 4.1g 3900 S  0.0 34.7   0:00.00 qemu-kvm


strace on src
# strace -p 19373 -p 19380 -p 19381 -p 19382 -p 19384
Process 19373 attached - interrupt to quit
Process 19380 attached - interrupt to quit
Process 19381 attached - interrupt to quit
Process 19382 attached - interrupt to quit
Process 19384 attached - interrupt to quit
[pid 19382] futex(0xb63040, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 19384] rt_sigtimedwait([ALRM IO],  <unfinished ...>
[pid 19381] futex(0xb63040, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 19380] rt_sigtimedwait([USR2],  <unfinished ...>
[pid 19373] write(18, "\1\0\0\0\0\0\0\0", 8




top on dst:
Tasks:   5 total,   0 running,   5 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  12291044k total,  5838552k used,  6452492k free,    99180k buffers
Swap: 10342392k total,        0k used, 10342392k free,  1344168k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND            
30111 root      15   0 4314m 4.0g 3824 S  0.0 34.5   0:01.44 qemu-kvm           
30118 root      18   0 4314m 4.0g 3824 S  0.0 34.5   0:00.00 qemu-kvm           
30119 root      18   0 4314m 4.0g 3824 S  0.0 34.5   0:00.00 qemu-kvm           
30120 root      15   0 4314m 4.0g 3824 S  0.0 34.5   0:00.00 qemu-kvm           
30122 root      18   0 4314m 4.0g 3824 S  0.0 34.5   0:00.00 qemu-kvm           

strace on dst:

# strace -p 30111 -p 30118 -p 30119 -p 30120 -p 30122
Process 30111 attached - interrupt to quit
Process 30118 attached - interrupt to quit
Process 30119 attached - interrupt to quit
Process 30120 attached - interrupt to quit
Process 30122 attached - interrupt to quit
[pid 30120] futex(0xb63040, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 30122] rt_sigtimedwait([ALRM IO],  <unfinished ...>
[pid 30119] futex(0xb63040, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 30118] rt_sigtimedwait([USR2],  <unfinished ...>
[pid 30111] recvfrom(21, 


Version-Release number of selected component (if applicable):
[root@dhcp-66-70-6 ~]# cat /etc/redhat-release 
Red Hat Enterprise Virtualization Hypervisor release 5.4-2.0.99 (7.1)
[root@dhcp-66-70-6 ~]# rpm -q kvm
kvm-83-77.el5
[root@dhcp-66-70-6 ~]# rpm -q kernel
kernel-2.6.18-154.el5


How reproducible:
2/10

Steps to Reproduce:
1.install a new rhel5u3 x86_64 server guest 
2.run the guest on src host and start listen mode on dst host.
3.implement migration
  
Actual results:


Expected results:


Additional info:
Host information 
CPU:
processor	: 7
vendor_id	: GenuineIntel
cpu family	: 6
model		: 26
model name	: Intel(R) Core(TM) i7 CPU         920  @ 2.67GHz
stepping	: 4
cpu MHz		: 1600.000
cache size	: 8192 KB
physical id	: 0
siblings	: 8
core id		: 3
cpu cores	: 4
apicid		: 7
fpu		: yes
fpu_exception	: yes
cpuid level	: 11
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx rdtscp lm constant_tsc ida nonstop_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr popcnt lahf_lm
bogomips	: 5319.97
clflush size	: 64
cache_alignment	: 64
address sizes	: 36 bits physical, 48 bits virtual
power management: [8]

# free -m
             total       used       free     shared    buffers     cached
Mem:         12002       5701       6301          0         96       1312
-/+ buffers/cache:       4291       7711
Swap:        10099          0      10099

Comment 1 Yaniv Kaul 2009-06-23 16:47:14 UTC
Not that I see how it may be related, but perhaps it's worth to test with EPT and then without VPID?

Comment 2 lihuang 2009-06-23 17:10:55 UTC
Can reproduce.
( it happened  on the second attempt )

# rmmod kvm_intel
# modprobe kvm_intel enable_ept=1 enable_vpid=0
# dmesg 
kvm: virtualization flags detected on this hardware: vmx tpr_shadow vnmi flexpriority ept vpid
loaded kvm module (kvm-83-77.el5)

^^ please correct me, if i am wrong

Comment 3 Dor Laor 2009-06-25 21:30:12 UTC
Does the strace output advances or stay stuck? Also get 'pstack PID' output.
Anything on dmesg?
Can you reduce the guest memory to 2G and recheck?
What happens if you disable ept completely?

Comment 4 lihuang 2009-06-26 08:31:55 UTC
(In reply to comment #3)
> Does the strace output advances or stay stuck? Also get 'pstack PID' output.
I have the migration ran more than 8h. that is the whole strace output.

> Anything on dmesg?
only logs like *cpu0 unimplemented perfctr wrmsr:* 
 
> Can you reduce the guest memory to 2G and recheck?
> What happens if you disable ept completely? 

I am retesting. and I disable ept by:
# rmmod kvm_intel
# modprobe kvm_intel enable_ept=0

is it enough for *disable ept completely* ?

Comment 5 lihuang 2009-06-26 09:12:53 UTC
with 2G ram.  
reproduced :

Note: ept is enabled . I am running the test without ept now 

[root@dhcp-66-70-6 ~]# pstack 2156                                                                                                                                                  
Thread 5 (Thread 0x4149c940 (LWP 2163)):                                                                                                                                            
#0  0x00002af01b78f744 in do_sigwaitinfo () from /lib64/libc.so.6                                                                                                                   
#1  0x00002af01b78f7fd in sigwaitinfo () from /lib64/libc.so.6                                                                                                                      
#2  0x0000000000419ff1 in fprintf ()                                                                                                                                                
#3  0x00002af01a5a14a7 in start_thread () from /lib64/libpthread.so.0                                                                                                               
#4  0x00002af01b831b7d in clone () from /lib64/libc.so.6
Thread 4 (Thread 0x41ece940 (LWP 2164)):
#0  0x00002af01a5a8174 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00002af01a5a3aca in _L_lock_1034 () from /lib64/libpthread.so.0
#2  0x00002af01a5a398c in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x0000000000516544 in fprintf ()
#4  0x0000000000516b64 in fprintf ()
#5  0x00002af01a5a14a7 in start_thread () from /lib64/libpthread.so.0
#6  0x00002af01b831b7d in clone () from /lib64/libc.so.6
Thread 3 (Thread 0x428cf940 (LWP 2165)):
#0  0x00002af01a5a8174 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00002af01a5a3aca in _L_lock_1034 () from /lib64/libpthread.so.0
#2  0x00002af01a5a398c in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x0000000000516544 in fprintf ()
#4  0x0000000000516b64 in fprintf ()
#5  0x00002af01a5a14a7 in start_thread () from /lib64/libpthread.so.0
#6  0x00002af01b831b7d in clone () from /lib64/libc.so.6
Thread 2 (Thread 0x43cd1940 (LWP 2167)):
#0  0x00002af01b78f744 in do_sigwaitinfo () from /lib64/libc.so.6
#1  0x00002af01b78f7fd in sigwaitinfo () from /lib64/libc.so.6
#2  0x0000000000419ff1 in fprintf ()
#3  0x00002af01a5a14a7 in start_thread () from /lib64/libpthread.so.0
#4  0x00002af01b831b7d in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x2af01f4be080 (LWP 2156)):
#0  0x00002af01a5a854b in write () from /lib64/libpthread.so.0
#1  0x00000000005163f3 in fprintf ()
#2  0x0000000000409cb3 in fprintf ()
#3  0x000000000047f895 in fprintf ()
#4  0x00000000004b3276 in fprintf ()
#5  0x00000000004b3336 in fprintf ()
#6  0x0000000000486587 in fprintf ()
#7  0x000000000047fa23 in fprintf ()
#8  0x00000000004808a9 in fprintf ()
#9  0x000000000040712c in fprintf ()
#10 0x00000000004099b7 in fprintf ()
#11 0x00000000005167fa in fprintf ()
#12 0x000000000040e415 in fprintf ()
#13 0x00002af01b77b994 in __libc_start_main () from /lib64/libc.so.6
#14 0x0000000000406be9 in fprintf ()
#15 0x00007fff0f319ff8 in ?? ()
#16 0x0000000000000000 in ?? ()
[root@dhcp-66-70-6 ~]#

Comment 6 lihuang 2009-06-26 12:44:01 UTC
> > What happens if you disable ept completely? 
> 
> I am retesting. and I disable ept by:
> # rmmod kvm_intel
> # modprobe kvm_intel enable_ept=0
> 
> is it enough for *disable ept completely* ?  

Can reproduce when ept is disabled 
How reproducible:
1/9  (the ninth attempt failed )


on another host :
processor       : 3
vendor_id       : GenuineIntel
cpu family      : 6
model           : 23
model name      : Intel(R) Core(TM)2 Quad CPU    Q9400  @ 2.66GHz

run the same steps  x20 .
==> can not reproduce.

Comment 9 lihuang 2009-07-03 14:43:18 UTC
Please note the test result in comment #7 is not on an i7 host, though it is more difficult to reproduce .
and we can reproduce the issue in :Red Hat Enterprise Virtualization Hypervisor release 5.4-2.0.99 (9.1)

Comment 10 Dor Laor 2009-07-05 16:25:20 UTC
(In reply to comment #5)

> [root@dhcp-66-70-6 ~]# pstack 2156                                              

..

> Thread 1 (Thread 0x2af01f4be080 (LWP 2156)):
> #0  0x00002af01a5a854b in write () from /lib64/libpthread.so.0
> #1  0x00000000005163f3 in fprintf ()
> #2  0x0000000000409cb3 in fprintf ()
> #3  0x000000000047f895 in fprintf ()

Can you install the kvm-debuginfo package so we can see the symbols?


> #4  0x00000000004b3276 in fprintf ()
> #5  0x00000000004b3336 in fprintf ()
> #6  0x0000000000486587 in fprintf ()
> #7  0x000000000047fa23 in fprintf ()
> #8  0x00000000004808a9 in fprintf ()
> #9  0x000000000040712c in fprintf ()

Comment 11 lihuang 2009-07-06 03:44:05 UTC
(In reply to comment #10)
> (In reply to comment #5)
> 
> > [root@dhcp-66-70-6 ~]# pstack 2156                                              
> 
> ..
> 
> > Thread 1 (Thread 0x2af01f4be080 (LWP 2156)):
> > #0  0x00002af01a5a854b in write () from /lib64/libpthread.so.0
> > #1  0x00000000005163f3 in fprintf ()
> > #2  0x0000000000409cb3 in fprintf ()
> > #3  0x000000000047f895 in fprintf ()
> 
> Can you install the kvm-debuginfo package so we can see the symbols?
> 
> 
> > #4  0x00000000004b3276 in fprintf ()
> > #5  0x00000000004b3336 in fprintf ()
> > #6  0x0000000000486587 in fprintf ()
> > #7  0x000000000047fa23 in fprintf ()
> > #8  0x00000000004808a9 in fprintf ()
> > #9  0x000000000040712c in fprintf ()  

kvm-debuginfo is installed by default.

[root@dhcp-66-70-31 /]# pstack  29034
Thread 5 (Thread 0x41af1940 (LWP 29041)):
#0  0x00002b7fb1fed744 in do_sigwaitinfo () from /lib64/libc.so.6
#1  0x00002b7fb1fed7fd in sigwaitinfo () from /lib64/libc.so.6
#2  0x000000000041a3f1 in fprintf ()
#3  0x00002b7fb0dff4a7 in start_thread () from /lib64/libpthread.so.0
#4  0x00002b7fb208fb7d in clone () from /lib64/libc.so.6
Thread 4 (Thread 0x424f2940 (LWP 29042)):
#0  0x00002b7fb0e06174 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00002b7fb0e01aca in _L_lock_1034 () from /lib64/libpthread.so.0
#2  0x00002b7fb0e0198c in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x0000000000517064 in fprintf ()
#4  0x0000000000517684 in fprintf ()
#5  0x00002b7fb0dff4a7 in start_thread () from /lib64/libpthread.so.0
#6  0x00002b7fb208fb7d in clone () from /lib64/libc.so.6
Thread 3 (Thread 0x42ef3940 (LWP 29043)):
#0  0x00002b7fb0e06174 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00002b7fb0e01aca in _L_lock_1034 () from /lib64/libpthread.so.0
#2  0x00002b7fb0e0198c in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x0000000000517064 in fprintf ()
#4  0x0000000000517684 in fprintf ()
#5  0x00002b7fb0dff4a7 in start_thread () from /lib64/libpthread.so.0
#6  0x00002b7fb208fb7d in clone () from /lib64/libc.so.6
Thread 2 (Thread 0x442f5940 (LWP 29045)):
#0  0x00002b7fb1fed744 in do_sigwaitinfo () from /lib64/libc.so.6
#1  0x00002b7fb1fed7fd in sigwaitinfo () from /lib64/libc.so.6
#2  0x000000000041a3f1 in fprintf ()
#3  0x00002b7fb0dff4a7 in start_thread () from /lib64/libpthread.so.0
#4  0x00002b7fb208fb7d in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x2b7fb5d1c080 (LWP 29034)):
#0  0x00002b7fb0e0654b in write () from /lib64/libpthread.so.0
#1  0x0000000000516f13 in fprintf ()
#2  0x0000000000409cb3 in fprintf ()
#3  0x000000000047fce5 in fprintf ()
#4  0x00000000004b3d96 in fprintf ()
#5  0x00000000004b3e56 in fprintf ()
#6  0x00000000004869d7 in fprintf ()
#7  0x000000000047fe73 in fprintf ()
#8  0x0000000000480cf9 in fprintf ()
#9  0x000000000040712c in fprintf ()
#10 0x00000000004099b7 in fprintf ()
#11 0x000000000051731a in fprintf ()
#12 0x000000000040e8d5 in fprintf ()
#13 0x00002b7fb1fd9994 in __libc_start_main () from /lib64/libc.so.6
#14 0x0000000000406be9 in fprintf ()
#15 0x00007fff38aa7fe8 in ?? ()
#16 0x0000000000000000 in ?? ()
[root@dhcp-66-70-31 /]# rpm -qa | grep kvm
etherboot-zroms-kvm-5.4.4-10.el5
kvm-debuginfo-83-83.el5
kvm-qemu-img-83-83.el5
kmod-kvm-83-83.el5
kvm-tools-83-83.el5
kvm-83-83.el5

Comment 12 Dor Laor 2009-07-06 11:09:53 UTC
For some reason the stack trace is not complete.
Anyway, from the log I can see the the source blocks on socket writes and the destination on read. It's weird, I'll ask them for additional info.

Can you keep tcpdump trace of the migration, especially in the last phase? (it should be quite big)

Comment 13 Brian Stein 2009-07-07 14:22:06 UTC
Please provide the information requested in Comment #12

Comment 14 lihuang 2009-07-07 17:24:17 UTC
tcpdump will be  available in 40min  at : 
http://focus.bne.redhat.com/~lihuang/tcpdump/
-> src : 32.pcap
   dst :  6.pcap


(qemu) info migrate 
Migration status: active
transferred ram: 3129992 kbytes
remaining ram: 1193684 kbytes
total ram: 4214796 kbytes
(qemu) info migrate 
Migration status: active
transferred ram: 3136840 kbytes
remaining ram: 1194916 kbytes
total ram: 4214796 kbytes
(qemu) info migrate 
Migration status: active
transferred ram: 3142420 kbytes
remaining ram: 1189772 kbytes
total ram: 4214796 kbytes
(qemu) migration_notify_done

Comment 15 Dor Laor 2009-07-08 10:10:06 UTC
I checked the source file. It looks like the connection is stopped due to spanning tree recalculation:

20:09:44.585092 IP dhcp-66-70-31.nay.redhat.com.39232 > dhcp-66-70-6.nay.redhat.com.5800: . 329592089:329593537(1448) ack 1 win 46 <nop,nop,timestamp 3279940 3314180>
20:09:44.585278 IP dhcp-66-70-31.nay.redhat.com.39232 > dhcp-66-70-6.nay.redhat.com.5800: P 329593537:329594985(1448) ack 1 win 46 <nop,nop,timestamp 3279940 3314180>
20:09:44.585321 IP dhcp-66-70-6.nay.redhat.com.5800 > dhcp-66-70-31.nay.redhat.com.39232: . ack 329594985 win 24576 <nop,nop,timestamp 3314208 3279940>


Here the communication stopped, maybe because a tap was added/published on the destination. Seems like it is the spanning tree that triggers it. You can see it is active below.



20:09:45.113629 arp who-has dhcp-66-70-31.nay.redhat.com tell dhcp-66-70-6.nay.redhat.com
20:09:45.113763 arp reply dhcp-66-70-31.nay.redhat.com is-at 00:21:9b:ff:b5:76 (oui Unknown)
20:09:45.206610 arp who-has 10.66.70.195 tell dhcp-66-70-6.nay.redhat.com
20:09:45.206756 arp reply 10.66.70.195 is-at 00:21:9b:58:5f:d3 (oui Unknown)


The tcp connection is probably paused - maybe there is no data to send


20:09:46.368998 STP 802.1w, Rapid STP, Flags [Learn, Forward], bridge-id 8098.00:1d:45:20:d5:c0.80a2, length 43
20:09:48.378657 STP 802.1w, Rapid STP, Flags [Learn, Forward], bridge-id 8098.00:1d:45:20:d5:c0.80a2, length 43
20:09:50.388659 STP 802.1w, Rapid STP, Flags [Learn, Forward], bridge-id 8098.00:1d:45:20:d5:c0.80a2, length 43
20:09:50.899022 IP 10.66.70.195.60169 > dhcp-66-70-6.nay.redhat.com.ssh: P 1027881100:1027881148(48) ack 346347990 win 136 <nop,nop,timestamp 74515412 2556650>
20:09:50.904599 IP dhcp-66-70-6.nay.redhat.com.ssh > 10.66.70.195.60169: P 1:49(48) ack 48 win 90 <nop,nop,timestamp 3320528 74515412>
20:09:50.904711 IP 10.66.70.195.60169 > dhcp-66-70-6.nay.redhat.com.ssh: . ack 49 win 136 <nop,nop,timestamp 74515414 3320528>
20:09:51.074070 IP 10.66.70.195.60169 > dhcp-66-70-6.nay.redhat.com.ssh: P 48:96(48) ack 49 win 136 <nop,nop,timestamp 74515456 3320528>



Please re-try with spanning tree of on the bridge.
Can you post brctl show command?



This is also alarming:
20:09:44.585092 IP dhcp-66-70-31.nay.redhat.com.39232 > dhcp-66-70-6.nay.redhat.com.5800: . 329592089:329593537(1448) ack 1 win 46 <nop,nop,timestamp 3279940 3314180>
20:09:44.585278 IP dhcp-66-70-31.nay.redhat.com.39232 > dhcp-66-70-6.nay.redhat.com.5800: P 329593537:329594985(1448) ack 1 win 46 <nop,nop,timestamp 3279940 3314180>
20:09:44.585321 IP dhcp-66-70-6.nay.redhat.com.5800 > dhcp-66-70-31.nay.redhat.com.39232: . ack 329594985 win 24576 <nop,nop,timestamp 3314208 3279940>
20:09:45.113629 arp who-has dhcp-66-70-31.nay.redhat.com tell dhcp-66-70-6.nay.redhat.com
20:09:45.113763 arp reply dhcp-66-70-31.nay.redhat.com is-at 00:21:9b:ff:b5:76 (oui Unknown)
20:09:45.206610 arp who-has 10.66.70.195 tell dhcp-66-70-6.nay.redhat.com
20:09:45.206756 arp reply 10.66.70.195 is-at 00:21:9b:58:5f:d3 (oui Unknown)
20:09:46.368998 STP 802.1w, Rapid STP, Flags [Learn, Forward], bridge-id 8098.00:1d:45:20:d5:c0.80a2, length 43
20:09:48.378657 STP 802.1w, Rapid STP, Flags [Learn, Forward], bridge-id 8098.00:1d:45:20:d5:c0.80a2, length 43
20:09:50.388659 STP 802.1w, Rapid STP, Flags [Learn, Forward], bridge-id 8098.00:1d:45:20:d5:c0.80a2, length 43
20:09:50.899022 IP 10.66.70.195.60169 > dhcp-66-70-6.nay.redhat.com.ssh: P 1027881100:1027881148(48) ack 346347990 win 136 <nop,nop,timestamp 74515412 2556650>
20:09:50.904599 IP dhcp-66-70-6.nay.redhat.com.ssh > 10.66.70.195.60169: P 1:49(48) ack 48 win 90 <nop,nop,timestamp 3320528 74515412>
20:09:50.904711 IP 10.66.70.195.60169 > dhcp-66-70-6.nay.redhat.com.ssh: . ack 49 win 136 <nop,nop,timestamp 74515414 3320528>
20:09:51.074070 IP 10.66.70.195.60169 > dhcp-66-70-6.nay.redhat.com.ssh: P 48:96(48) ack 49 win 136 <nop,nop,timestamp 74515456 3320528>

Comment 16 lihuang 2009-07-08 11:21:13 UTC
> 
> Please re-try with spanning tree of on the bridge.
> Can you post brctl show command?
> 

Yes. stp is not enable by default. retrying ..

Comment 17 Dor Laor 2009-07-08 11:32:45 UTC
(In reply to comment #16)
> > 
> > Please re-try with spanning tree of on the bridge.
> > Can you post brctl show command?
> > 
> 
> Yes. stp is not enable by default. retrying ..  

It is good that stp is not enabled.
The thing is why do we get a networking outage? There are STP recaluculations and there is a lost connection between the source and destination (that's why you see the arp request).

Please check how this differs from other hosts.

Comment 18 lihuang 2009-07-08 12:35:22 UTC
After i set stp on/yes.  connection of host lost ....
I am going to test with in a private network.
(BTW. if enable stp is a workaround for this issue, why we didn't set it as default ? )

Comment 19 Dor Laor 2009-07-09 09:24:12 UTC
(In reply to comment #18)
> After i set stp on/yes.  connection of host lost ....
> I am going to test with in a private network.
> (BTW. if enable stp is a workaround for this issue, why we didn't set it as
> default ? )  


It is not the workaround. STP should be disabled (the default).
You might have some setup issues.

Comment 20 lihuang 2009-07-11 11:29:10 UTC
Created attachment 351336 [details]
sosreport

can reproduce in kvm-83-87.el5.
and I haven't found the differs from other hosts.

Attached is the sosreport. please check.

Comment 21 Dor Laor 2009-07-12 08:51:03 UTC
From above I can see that you do have spanning tree running:

"Jul 11 11:08:30 dhcp-66-70-6 kernel: breth0: topology change detected, propagating
Jul 11 11:08:30 dhcp-66-70-6 kernel: breth0: port 3(tap0) entering forwarding state
Jul 11 11:12:40 dhcp-66-70-6 kernel: breth0: port 3(tap0) entering disabled state
Jul 11 11:12:40 dhcp-66-70-6 kernel: device tap0 left promiscuous mode
Jul 11 11:12:40 dhcp-66-70-6 kernel: breth0: port 3(tap0) entering disabled state
Jul 11 11:13:01 dhcp-66-70-6 kernel: device tap0 entered promiscuous mode
Jul 11 11:13:01 dhcp-66-70-6 kernel: breth0: topology change detected, propagating
Jul 11 11:13:01 dhcp-66-70-6 kernel: breth0: port 3(tap0) entering forwarding state
"

Please run brctl stp breth0 off and brctl setfd breth 0 and retest.

Also please add kvm sosreport plugin for next time.

Comment 25 Dor Laor 2009-07-14 21:32:58 UTC
Can you try using this rpm (private build):
https://brewweb.devel.redhat.com/taskinfo?taskID=1888898

Comment 26 lihuang 2009-07-15 09:05:18 UTC
Hi Dor 

test 10 times . not reproduce the original issue so far.
but got a qemu monitor error:

remaining ram: 105200 kbytes
total ram: 4214796 kbytes
(qemu) info migrate 
Migration status: active
transferred ram: 4117176 kbytes
remaining ram: 97824 kbytes
total ram: 4214796 kbytes
(qemu) migration_notify_done
failed to notify io thread
failed to notify io thread
failed to notify io thread
failed to notify io thread
failed to notify io thread
failed to notify io thread
failed to notify io thread
failed to notify io thread
failed to notify io thread
failed to notify io thread
failed to notify io thread
failed to notify io thread
failed to notify io thread
failed to notify io thread
failed to notify io thread
failed to notify io thread
failed to notify io thread
failed to notify io thread
failed to notify io thread
failed to notify io thread
failed to notify io thread
failed to notify io thread
failed to notify io thread
failed to notify io thread
failed to notify io thread
failed to notify io thread
failed to notify io thread
failed to notify io thread
failed to notify io thread
failed to notify io thread
failed to notify io thread
failed to notify io thread
failed to notify io thread
failed to notify io thread
failed to notify io thread
failed to notify io thread
failed to notify io thread
info migrate 
Migration status: completed
(qemu)

Comment 27 Dor Laor 2009-07-15 09:26:30 UTC
(In reply to comment #26)
> Hi Dor 
> 
> test 10 times . not reproduce the original issue so far.
> but got a qemu monitor error:
> 
> remaining ram: 105200 kbytes
> total ram: 4214796 kbytes
> (qemu) info migrate 
> Migration status: active
> transferred ram: 4117176 kbytes
> remaining ram: 97824 kbytes
> total ram: 4214796 kbytes
> (qemu) migration_notify_done
> failed to notify io thread
> failed to notify io thread

Superb!
This message exactly shows that the solution is good.
I'll erase it from the code.

Comment 35 Suqin Huang 2009-07-23 13:14:52 UTC
tested on kvm-83-94.el5

command used:
source:
/usr/libexec/qemu-kvm -drive file=/mnt/RHEL-4.8-64.qcow2 -net nic,macaddr=20:20:20:00:18:59,model=virtio -net tap,script=/etc/qemu-ifup -rtc-td-hack -no-hpet -usbdevice tablet -cpu qemu64,+sse2 -smp 2 -m 4096 -vnc :12 -monitor stdio -boot c

des:
/usr/libexec/qemu-kvm -drive file=/mnt/RHEL-4.8-64.qcow2 -net nic,macaddr=20:20:20:00:18:59,model=virtio -net tap,script=/etc/qemu-ifup -rtc-td-hack -no-hpet -usbdevice tablet -cpu qemu64,+sse2 -smp 2 -m 4096 -vnc :12 -monitor stdio -boot c -incoming tcp:0:6000

try several time with the following situation during :
1. run stress  testing during migration on rhel5.4-32
stress -i 4 -m 4 -c 4 -d 4

2. download file from website during migration on rhel4.8-32

3. play flash during migration on rhel5.4-64 

4. reboot vm during migration on rhel4.8-64
5. migrate rhel3.9-32 when vm is started.


Can not reproduce this issue with these situations.
 Change the status to *VERIFIED*,please reopen it if you can reproduce.

Comment 37 errata-xmlrpc 2009-09-02 09:26:31 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHEA-2009-1272.html


Note You need to log in before you can comment on or make changes to this bug.