Bug 589017 - [rhel5.5] [kvm] dead lock in qemu during off-line migration
Summary: [rhel5.5] [kvm] dead lock in qemu during off-line migration
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kvm
Version: 5.7
Hardware: All
OS: Linux
low
medium
Target Milestone: rc
: ---
Assignee: Juan Quintela
QA Contact: Virtualization Bugs
URL:
Whiteboard:
Depends On:
Blocks: Rhel5KvmTier1
TreeView+ depends on / blocked
 
Reported: 2010-05-05 06:34 UTC by Haim
Modified: 2014-01-13 00:45 UTC (History)
12 users (show)

Fixed In Version: kvm-83-213.el5
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2011-01-13 23:35:28 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2011:0028 0 normal SHIPPED_LIVE Low: kvm security and bug fix update 2011-01-13 11:03:39 UTC

Description Haim 2010-05-05 06:34:21 UTC
Description of problem:

qemu seem to enter some kind of dead lock when ever I performed off-line migration (suspend, save to disk) on specific host. 
when I try to connect vmid.monitor.socker via NC it doesn't respond (qemu shell doesn't show). 

attaching gdb to qemu-kvm and problematic process id shows the following lock:

(gdb) info threads
  4 Thread 0x4270c940 (LWP 31014)  0x0000003834631744 in do_sigwaitinfo () from /lib64/libc.so.6
  3 Thread 0x4310d940 (LWP 31053)  0x0000003834e0d4c4 in __lll_lock_wait () from /lib64/libpthread.so.0
  2 Thread 0x41907940 (LWP 31056)  0x0000003834631744 in do_sigwaitinfo () from /lib64/libc.so.6
* 1 Thread 0x2ba1f6b0cfa0 (LWP 31013)  0x0000003834e0d89b in write () from /lib64/libpthread.so.0
(gdb)

thread 3: 

gdb) where
#0  0x0000003834e0d4c4 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x0000003834e08e1a in _L_lock_1034 () from /lib64/libpthread.so.0
#2  0x0000003834e08cdc in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00000000004fef44 in kvm_main_loop_wait (env=0x643eb90, timeout=<value optimized out>)
    at /usr/src/debug/kvm-83-maint-snapshot-20090205/qemu/qemu-kvm.c:257
#4  0x00000000004ff5a4 in kvm_main_loop_cpu (_env=<value optimized out>)
    at /usr/src/debug/kvm-83-maint-snapshot-20090205/qemu/qemu-kvm.c:392
#5  ap_main_loop (_env=<value optimized out>) at /usr/src/debug/kvm-83-maint-snapshot-20090205/qemu/qemu-kvm.c:443
#6  0x0000003834e0673d in start_thread () from /lib64/libpthread.so.0
#7  0x00000038346d3d1d in clone () from /lib64/libc.so.6


thread 4

(gdb) where
#0  0x0000003834631744 in do_sigwaitinfo () from /lib64/libc.so.6
#1  0x00000038346317fd in sigwaitinfo () from /lib64/libc.so.6
#2  0x000000000041a3c1 in sigwait_compat (opaque=<value optimized out>) at compatfd.c:38
#3  0x0000003834e0673d in start_thread () from /lib64/libpthread.so.0
#4  0x00000038346d3d1d in clone () from /lib64/libc.so.6

I am sure more can be reviled by looking the problematic host, so please ask for more information.

reproduction: 100% on specific host runs kvm-83-164.el5



Steps to Reproduce:
1. suspend vm (save to disk, migration)
2. qemu is not responding 

  


Additional info:

Comment 1 Haim 2010-05-05 06:37:20 UTC
please not that same operation succeeds over other hosts running same version of kvm and kernel 2.6.18-194. 

also - operation was performed from rhev-m --> vdsm --> kvm

Comment 2 Gleb Natapov 2010-05-05 09:36:53 UTC
backtrace of thread 1:
0  0x0000003834e0d89b in write () from /lib64/libpthread.so.0
#1  0x0000000000473bdc in file_write (s=<value optimized out>, buf=0x1d2f9058, 
    size=20480) at migration-exec.c:42
#2  0x000000000046afaf in migrate_fd_put_buffer (opaque=0x1d04dd40, 
    data=0x1d2f9058, size=20480) at migration.c:211
#3  0x000000000049bd2d in buffered_put_buffer (opaque=0x1d2d8d40, 
    buf=0x1d2f6058 "\275 \377\377\377\213\301\301\351\002\363\245\213È\341\003\363\244\203M\374\377\213\205 \377\377\377\211\205\020\377\377\377\213\205\024\377\377\377\203\300\003\203\340\374\001\205 \377\377\377\213\205 \377\377\377\211\205", pos=<value optimized out>, size=32768) at buffered_file.c:134
#4  0x0000000000471b38 in qemu_fflush (f=0x1d2f6010) at savevm.c:419
#5  0x0000000000472e95 in qemu_put_buffer (f=0x1d2f6010, 
    buf=0x2b1d8c5fc33a "\003~\034\003^ \213E\374\353\337_^[\311\302\004", 
    size=3270) at savevm.c:482
#6  0x0000000000408af8 in ram_save_block (f=0x1d2f6010)
    at /usr/src/debug/kvm-83-maint-snapshot-20090205/qemu/vl.c:3358
#7  0x0000000000408b6c in ram_save_live (f=0x1d2f6010, stage=2, 
    opaque=<value optimized out>)
    at /usr/src/debug/kvm-83-maint-snapshot-20090205/qemu/vl.c:3427
#8  0x0000000000472cea in qemu_savevm_state_iterate (f=0x1d2f6010)
    at savevm.c:768
#9  0x000000000046b09c in migrate_fd_put_ready (opaque=<value optimized out>)
    at migration.c:256
#10 0x00000000004071bc in qemu_run_timers (ptimer_head=0xb38e00, 
    current_time=158911986)
    at /usr/src/debug/kvm-83-maint-snapshot-20090205/qemu/vl.c:1271
#11 0x0000000000409577 in main_loop_wait (timeout=<value optimized out>)
    at /usr/src/debug/kvm-83-maint-snapshot-20090205/qemu/vl.c:4021
#12 0x00000000004ff1ea in kvm_main_loop ()
    at /usr/src/debug/kvm-83-maint-snapshot-20090205/qemu/qemu-kvm.c:596
#13 0x000000000040e425 in main_loop (argc=43, argv=0x7fff8cc0c588, 
    envp=<value optimized out>)
    at /usr/src/debug/kvm-83-maint-snapshot-20090205/qemu/vl.c:4040
#14 main (argc=43, argv=0x7fff8cc0c588, envp=<value optimized out>)
    at /usr/src/debug/kvm-83-maint-snapshot-20090205/qemu/vl.c:6476

Comment 3 Gleb Natapov 2010-05-05 09:39:04 UTC
The process is not really stuck. Seams that qemu_run_timers() always calls migrate_fd_put_ready() so nothing else has chances to run.

Comment 4 Juan Quintela 2010-10-26 17:23:59 UTC
Reproduced, and should be fixed in next released KVM.

Comment 9 Mike Cao 2010-11-24 11:00:40 UTC
(In reply to comment #1)
> please not that same operation succeeds over other hosts running same version
> of kvm and kernel 2.6.18-194. 
> 
> also - operation was performed from rhev-m --> vdsm --> kvm


Hi, Harm

you mean some specific host can trigger it .could you supply me the host info so that I can reproduce it ? 

thanks 
Mike

Comment 10 Haim 2010-11-24 12:08:49 UTC
(In reply to comment #9)
> (In reply to comment #1)
> > please not that same operation succeeds over other hosts running same version
> > of kvm and kernel 2.6.18-194. 
> > 
> > also - operation was performed from rhev-m --> vdsm --> kvm
> 
> 
> Hi, Harm
> 
> you mean some specific host can trigger it .could you supply me the host info
> so that I can reproduce it ? 
> 
> thanks 
> Mike

Mike, bug was opened long time ago, which means that I don't have the exact host, and information. 
please see Juan comment - he managed to reproduced, maybe you can ask him. 

other then that, bug was surly fixed, as we run lots of migration\suspend testing (regression) on rhel5.x, and no one in our group came across it lately. 
this is up to you.

Comment 14 errata-xmlrpc 2011-01-13 23:35:28 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0028.html


Note You need to log in before you can comment on or make changes to this bug.