Bug 654937 - kvm freezes after migration cancellation attempt.
kvm freezes after migration cancellation attempt.
Status: CLOSED WONTFIX
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kvm (Show other bugs)
5.6
x86_64 Linux
high Severity high
: rc
: ---
Assigned To: Juan Quintela
Virtualization Bugs
: Reopened
Depends On:
Blocks: Rhel5KvmTier1
  Show dependency treegraph
 
Reported: 2010-11-19 00:09 EST by Mike Cao
Modified: 2013-01-09 18:21 EST (History)
10 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 669581 (view as bug list)
Environment:
Last Closed: 2011-07-18 06:20:57 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Mike Cao 2010-11-19 00:09:07 EST
Description of problem:
when start migration ,use iptables to blocks the migration tcp port in dst host ,
then quit qemu-kvm process in dst host ,(qemu)info migrate in src should be returns "migration failed" ,but is always shows migration is in progress.

Version-Release number of selected component (if applicable):
# uname -r
2.6.18-231.el5
# rpm -q kvm
kvm-83-207.el5

How reproducible:
100%

Steps to Reproduce:
1.start VM in src host :
CLI:/usr/libexec/qemu-kvm -m 10G -smp 1 -name RHEL3_64 -uuid 59960563-0abf-79df-fdfb-8462354d62b8 -no-kvm-pit-reinjection -boot c -drive file=/mnt/RHEL3_32.raw,if=ide,format=raw,cache=none,boot=on -net nic,macaddr=04:52:00:35:e8:6a,vlan=0,model=e1000 -net tap,script=/etc/qemu-ifup,vlan=0 -serial pty -parallel none -usb -vnc :4 -monitor stdio 
2.clean all the firewall rules in dst host 
# iptables -F
3.start listenning port in dst host 
<commandLine> -incoming tcp:0:5888
4.begin live migration,before migration completed ,use iptables commands to reject the migration tcp port 
#iptables -A INPUT -p tcp --dport 5888 -j REJECT
5.kill qemu-kvm process in dst host
  
Actual results:
(qemu)info migrate 
Migration status: active
transferred ram: 904060 kbytes
remaining ram: 9638584 kbytes
total ram: 10506252 kbytes

and migration will nerver be end.

Expected results:
(qemu)info migrate
Migrationg Failed.



Additional info:
after step 2 ,check firewall rules in dst host
#iptables -L
]# iptables -L
Chain INPUT (policy ACCEPT)
target     prot opt source               destination         

Chain FORWARD (policy ACCEPT)
target     prot opt source               destination         

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination  

after step5 , check firewall rules in dst host.
# iptables -L
Chain INPUT (policy ACCEPT)
target     prot opt source               destination         
REJECT     tcp  --  anywhere             anywhere            tcp dpt:5888 reject-with icmp-port-unreachable 

Chain FORWARD (policy ACCEPT)
target     prot opt source               destination         

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination
Comment 4 Luiz Capitulino 2011-01-13 08:09:47 EST
Mike, how many minutes did you wait?
Comment 6 Mike Cao 2011-01-13 21:40:24 EST
(In reply to comment #4)
> Mike, how many minutes did you wait?

More than 30mins
Comment 8 Luiz Capitulino 2011-01-14 07:51:46 EST
(In reply to comment #6)
> (In reply to comment #4)
> > Mike, how many minutes did you wait?
> 
> More than 30mins

Ok, then it's likely a bug. I mean, we if we did get a response then this wouldn't be a bug.
Comment 9 Juan Quintela 2011-01-14 08:48:48 EST
    This is not a kvm bug, it is a libvirt/virt-manager/rhev-m bug.

    THis is TCP for you :-(

    Only thing that migration code can do is to put a timeout, says 10mins by
    default, and "cancel" migration it it was not able to sent any data on so much
    time.  Nothing else.  But any management app can do exactly the same.  They
    issue "info migrate", and if "transferred ram" is stuck for <timeout> long,
    just cancel migration.

    What do you expect migration to do here?

    Later, Juan.
Comment 10 Luiz Capitulino 2011-01-14 14:21:12 EST
You're right that the real issue should be solved at the management level, ie. hardcoded timeouts should _not_ be added to qemu.

However, I believe that the scenario described in the bug report is expected to eventually fail. Doesn't matter if it's 1 or 40 minutes, I think send()/write() should eventually fail. If this assumption is correct and if this is not happening, then we're likely ignoring or not reporting the error back to the user. That would be a valid bug.

Note that I'm not discussing its severity (ie. does it really matter to report an error after 30 minutes), but it's worth investigating.
Comment 11 Dor Laor 2011-01-16 03:16:38 EST
Closing this bug due to the above comments.
Tcp keep alive over the live migration socket will help, but it is not that important. I'll add it to the todo list
Comment 12 Juan Quintela 2011-01-17 07:11:46 EST
(In reply to comment #10)
> You're right that the real issue should be solved at the management level, ie.
> hardcoded timeouts should _not_ be added to qemu.
> 
> However, I believe that the scenario described in the bug report is expected to
> eventually fail. Doesn't matter if it's 1 or 40 minutes, I think send()/write()
> should eventually fail. If this assumption is correct and if this is not
> happening, then we're likely ignoring or not reporting the error back to the
> user. That would be a valid bug.
> 
> Note that I'm not discussing its severity (ie. does it really matter to report
> an error after 30 minutes), but it's worth investigating.

THis is not how qemu works.

We do a non-blocking write() when conection is ready to accept packets.
If the connection is blocked after some communication, we have the host kernel buffers for that socket full, and socket will not become ready again anymore.  So we don't do any other write, and we never found the error.

As said, a timeout is needed, and it is as easy to add it to management level than to qemu.

Later, Juan.
Comment 13 Mike Cao 2011-04-29 02:41:59 EDT
According to comment #9
migrate_cancel after the steps in comment #0.
1.(qemu)migrate_cancel  .

Actual Results:
qemu-kvm process freezed.

Based on above ,reopen this issue for further investigation.
Comment 14 juzhang 2011-05-05 23:26:56 EDT
mark this issue as ack+
1. can be reproduced.
2. comment13
3. same bug still open in rhel6.1 which proposed to fixed in rhel6.2
Comment 15 Glauber Costa 2011-06-13 19:03:57 EDT
I am changing the bug description. Based on the last comments, this is a issue that migration freezing after a cancellation attempt. Has nothing to do with timeouts.

Note You need to log in before you can comment on or make changes to this bug.