Bug 523158

Summary: virsh restore causes libvirtd segfault virExec()
Product: [Fedora] Fedora Reporter: IBM Bug Proxy <bugproxy>
Component: libvirtAssignee: Daniel Veillard <veillard>
Status: CLOSED RAWHIDE QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: high Docs Contact:
Priority: high    
Version: rawhideCC: berrange, clalance, crobinso, itamar, markmc, veillard, virt-maint
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: All   
Whiteboard:
Fixed In Version: libvirt-0.7.1-13.fc12 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-10-21 07:24:02 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 498968    
Attachments:
Description Flags
/var/log/messages of the host
none
debug log for virsh save
none
debug log for virsh restore
none
strace of virsh restore
none
stack trace
none
Save file of the guest none

Description IBM Bug Proxy 2009-09-14 09:20:47 UTC
=Comment: #0=================================================
SANTWANA SAMANTRAY <santwana.samantray.com> - 

Virsh restore of the KVM (RHEL 5.4) guest fails in Fedora12 Alpha. Trying to restore the guest from
a saved file, returns in segfault of that particular qemu-kvm process.

Command used:
virsh restore /home/saveguest

It waits for 2 mins, and ends up in segfault as below:
qemu-kvm[2323]: segfault at 1205cb0 ip 0000000001205cb0 sp 00007fff32dca858 error 15

Attachment: /var/log/messages of the host

=Comment: #4=================================================
ANOOP C. VIJAYAN <anoop.vijayan.com> - 

Image file obtained via virsh save. Note that the size is just 1.5KB. So the save is failing, though
virsh says that domain is saved.

[root@mx3650b ~]# ls -lh /var/lib/libvirt/rhel5.4save*
-rw------- 1 root root 1.5K 2009-09-11 16:18 /var/lib/libvirt/rhel5.4save1
-rw------- 1 root root 1.5K 2009-09-11 12:52 /var/lib/libvirt/rhel5.4save2

Attaching the debug logs for libvirt
Attachment: debug log for virsh save

I'm able to save/restore successfully using qemu-kvm/qemu monitor. The image obtained after save
from qemu-monitor for the same guest shows 146M of size.

[root@mx3650b ~]# ls -lh statefile.gz
-rw-r--r-- 1 root root 146M 2009-09-14 12:19 statefile.gz

Comment 1 IBM Bug Proxy 2009-09-14 09:20:59 UTC
Created attachment 360905 [details]
/var/log/messages of the host

Comment 2 IBM Bug Proxy 2009-09-14 09:21:07 UTC
Created attachment 360906 [details]
debug log for virsh save

Comment 3 IBM Bug Proxy 2009-09-14 09:41:40 UTC
Created attachment 360912 [details]
debug log for virsh restore


------- Comment on attachment From anoop.vijayan.com 2009-09-14 05:34 EDT-------


Once the restore command is run, connection to the hypervisor gets lost..

[root@mx3650b ~]# virsh list --all
^C <===== hangs here, need to kill

Attaching the debug log for 'virsh restore'

[root@mx3650b ~]# ps -eaf|grep qemu
root     10423 10031 99 12:42 pts/0    02:13:43 /usr/bin/qemu-kvm -S -M pc-0.11 -m 2000 -smp 4 -name rhel5_4_guest -uuid ed1f35f5-207c-5a31-6818-28c06782c1d5 -monitor unix:/var/run/libvirt/qemu/rhel5_4_guest.monitor,server,nowait -boot c -drive file=/var/lib/libvirt/images/rhel5_4_guest.img,if=ide,index=0,boot=on,format=raw -drive file=,if=ide,media=cdrom,index=2 -net nic,macaddr=52:54:00:49:23:9d,vlan=0,name=nic.0 -net tap,fd=19,vlan=0,name=tap.0 -serial pty -parallel none -usb -vnc 127.0.0.1:0 -k en-us -vga cirrus -soundhw es1370
qemu     15445     1 99 14:51 ?        00:07:44 /usr/bin/qemu-kvm -S -M pc-0.11 -m 2000 -smp 4 -name rhel5_4_guest -uuid ed1f35f5-207c-5a31-6818-28c06782c1d5 -monitor unix:/var/run/libvirt/qemu/rhel5_4_guest.monitor,server,nowait -boot c -drive file=/var/lib/libvirt/images/rhel5_4_guest.img,if=ide,index=0,boot=on,format=raw -drive file=,if=ide,media=cdrom,index=2 -net nic,macaddr=52:54:00:49:23:9d,vlan=0,name=nic.0 -net tap,fd=21,vlan=0,name=tap.0 -serial pty -parallel none -usb -vnc 127.0.0.1:1 -k en-us -vga cirrus -soundhw es1370 -incoming exec:cat

[root@mx3650b ~]# ps -eaf|grep libvirtd
root     10815     1  1 13:32 ?        00:01:45 libvirtd --daemon

[root@mx3650b ~]# /etc/init.d/libvirtd status
libvirtd (pid  10815) is running...

Comment 4 Mark McLoughlin 2009-09-17 11:49:32 UTC
Okay, it looks like we might have both a qemu and a libvirtd segfault here. Fun!

What versions of libvirt and qemu is this?

Can you reproduce with the latest versions?

We need to try and get stack traces of both these segfaults if they are still happenining before we can figure out what the problem is

Comment 5 IBM Bug Proxy 2009-09-18 10:31:47 UTC
Created attachment 361623 [details]
strace of virsh restore


------- Comment on attachment From santwana.samantray.com 2009-09-18 06:21 EDT-------


Hello Redhat,

I have created an attachment for the strace of virsh restore <guest> .
This is reproducible with the latest versions also, the guest fails to restore from a saved file, and it returns a libvirtd segfault.

Versions installed are:
libvirt-0.7.1-1.fc12.x86_64
qemu-0.10.92-4.fc12.x86_64

Thanks,
Santwana

Comment 6 Mark McLoughlin 2009-09-21 13:48:34 UTC
I'm afraid it's a stack trace of the segfaulting process we need, not an strace

https://fedoraproject.org/wiki/StackTraces

If you're having difficulty obtaining a stack trace, feel free to ask for more help

Comment 7 IBM Bug Proxy 2009-09-23 12:22:36 UTC
Created attachment 362290 [details]
stack trace


------- Comment on attachment From santwana.samantray.com 2009-09-23 08:14 EDT-------


Hello Redhat,

I have created an attachment for the stack trace(bt and bt full) of the segfaulting libvirtd process.

Below are the versions installed.
libvirt-client-0.7.1-4.fc12.x86_64
libvirt-0.7.1-4.fc12.x86_64
libvirt-python-0.7.1-4.fc12.x86_64
libvirt-devel-0.7.1-4.fc12.x86_64
libvirt-debuginfo-0.7.1-4.fc12.x86_64

Thanks,
Santwana

Comment 8 Mark McLoughlin 2009-09-23 15:56:51 UTC
santwana: I'm not sure this is a backtrace of a segfault; I think it just may be the debugger stopped when it spawned the qemu process. If you continue on, can you get a stack trace of the actual SEGV ? Thanks

Comment 9 IBM Bug Proxy 2009-10-07 11:02:04 UTC
------- Comment From anoop.vijayan.com 2009-10-07 06:50 EDT-------
This is where the segfault occurs.

Program terminated with signal 11, Segmentation fault
#0  0x0000003e6cca7d32 in execvpe () from /lib64/libc.so.6
(gdb) disassemble execvpe
Dump of assembler code for function execvpe:
0x0000003e6cca7d10 <execvpe+0>: push   %rbp
0x0000003e6cca7d11 <execvpe+1>: mov    %rsp,%rbp
0x0000003e6cca7d14 <execvpe+4>: mov    %rbx,-0x28(%rbp)
0x0000003e6cca7d18 <execvpe+8>: mov    %r12,-0x20(%rbp)
0x0000003e6cca7d1c <execvpe+12>:        mov    %rsi,%rbx
0x0000003e6cca7d1f <execvpe+15>:        mov    %r14,-0x10(%rbp)
0x0000003e6cca7d23 <execvpe+19>:        mov    %r13,-0x18(%rbp)
0x0000003e6cca7d27 <execvpe+23>:        mov    %rdi,%r14
0x0000003e6cca7d2a <execvpe+26>:        mov    %r15,-0x8(%rbp)
0x0000003e6cca7d2e <execvpe+30>:        add    $0xffffffffffffff80,%rsp
0x0000003e6cca7d32 <execvpe+34>:        cmpb   $0x0,(%rdi)   <==============
0x0000003e6cca7d35 <execvpe+37>:        mov    %rdx,%r12
0x0000003e6cca7d38 <execvpe+40>:        je     0x3e6cca7d88 <execvpe+120>
0x0000003e6cca7d3a <execvpe+42>:        mov    $0x2f,%esi
0x0000003e6cca7d3f <execvpe+47>:        callq  0x3e6cc7f6f0 <strchr>
0x0000003e6cca7d44 <execvpe+52>:        test   %rax,%rax
0x0000003e6cca7d47 <execvpe+55>:        je     0x3e6cca7e58 <execvpe+328>

(gdb) bt
#0  0x0000003e6cca7d32 in execvpe () from /lib64/libc.so.6
#1  0x0000003e70026286 in __virExec (conn=0x0, argv=<value optimized out>, envp=<value optimized out>, keepfd=<value optimized out>,
retpid=<value optimized out>, infd=16, outfd=<value optimized out>, errfd=<value optimized out>, flags=<value optimized out>,
hook=<value optimized out>, data=<value optimized out>, pidfile=<value optimized out>) at util.c:563
#2  0x0000003e70026c33 in virExecWithHook (conn=0x7f149000c750, argv=0x7f14a5fada30, envp=0x0, keepfd=0x0, retpid=0x7f14a5fada58, infd=16,
outfd=<value optimized out>, errfd=<value optimized out>, flags=<value optimized out>, hook=<value optimized out>,
data=<value optimized out>, pidfile=<value optimized out>) at util.c:612
#3  0x0000003e70026def in virExec (conn=<value optimized out>, argv=<value optimized out>, envp=<value optimized out>,
keepfd=<value optimized out>, retpid=<value optimized out>, infd=<value optimized out>, outfd=<value optimized out>,
errfd=<value optimized out>, flags=<value optimized out>) at util.c:632
#4  0x0000000000435680 in qemudDomainRestore (conn=0x7f149000c750, path=<value optimized out>) at qemu_driver.c:4541
#5  0x0000003e700389bf in virDomainRestore (conn=0x7f149000c750, from=0x7f1494003b70 "/home/newsave") at libvirt.c:2234
#6  0x000000000041c00f in remoteDispatchDomainRestore (server=<value optimized out>, client=<value optimized out>, conn=0x7f149000c750,
rerr=0x7f14a5faebd0, args=<value optimized out>, ret=<value optimized out>) at remote.c:1635
#7  0x000000000041f8f6 in remoteDispatchClientCall (server=<value optimized out>, client=0x21b11e0, msg=0x227c030) at dispatch.c:459
#8  0x000000000041fbb4 in remoteDispatchClientRequest (server=<value optimized out>, client=0x21b11e0, msg=0x227c030) at dispatch.c:359
#9  0x000000000041396c in qemudWorker (data=0x2219120) at qemud.c:1464
#10 0x0000003e6d40696a in start_thread () from /lib64/libpthread.so.0
#11 0x0000003e6cce12cd in clone () from /lib64/libc.so.6
#12 0x0000000000000000 in ?? ()

(gdb) info registers
rax            0x3e6cf7cf60     268116152160
rbx            0x7f14a5fada30   139726660753968
rcx            0xfffffffffffffed0       -304
rdx            0x2179f00        35102464
rsi            0x7f14a5fada30   139726660753968
rdi            0x0      0                                       <=====
rbp            0x7f14a5fad6d0   0x7f14a5fad6d0
rsp            0x7f14a5fad650   0x7f14a5fad650
r8             0x0      0
r9             0x40     64
r10            0x8      8
r11            0x0      0
r12            0x400    1024
r13            0x13     19
r14            0x0      0
r15            0x10     16
rip            0x3e6cca7d32     0x3e6cca7d32 <execvpe+34>
eflags         0x10207  [ CF PF IF RF ]
cs             0x33     51
ss             0x2b     43
ds             0x0      0
es             0x0      0
fs             0x0      0
gs             0x0      0
fctrl          0x37f    895
fstat          0x0      0
ftag           0xffff   65535
fiseg          0x0      0
fioff          0x0      0
foseg          0x0      0
fooff          0x0      0
fop            0x0      0
mxcsr          0x1f80   [ IM DM ZM OM UM PM ]

[root@mx3650b ~]# rpm -q glibc
glibc-2.10.90-22.x86_64

Comment 10 Mark McLoughlin 2009-10-09 11:20:45 UTC
anoop: thanks for the backtrace

danpb, DV, clalance: maybe one of you guys could look at this?

Comment 11 Chris Lalancette 2009-10-09 11:39:45 UTC
Actually, it's quite odd.  From the disassembly, it looks like it was trying to dereference %rdi.  In the AMD64 calling convention, %rdi holds the first passed parameter, which in this case would be argv[0].  So that suggests that argv[0] is NULL, and then the dereference of that causes the crash.

One way I could imagine this happening is because the save file is actually corrupt.  That is, the original issue mentioned that the save file is only 1.5KB in size, meaning that the save didn't work properly.  So that means we have two bugs here:

1)  Why didn't the save complete properly.
2)  Why a restore from this corrupted save file doesn't just fail, instead of crashing libvirtd.

SANTWANA: can you attach one of those 1.5KB save files to this BZ?  I'm curious what is in them that might be causing this to occur.

Thanks,
Chris Lalancette

Comment 12 Daniel Berrangé 2009-10-09 11:56:44 UTC
I think collecting  a trace of libvirtd  using valgrind could prove much more useful than this stack trace. The actual corruption may well have occurred earlier, and valgrind might do a better job of show where.

Comment 13 IBM Bug Proxy 2009-10-09 13:22:21 UTC
Created attachment 364244 [details]
Save file of the guest


------- Comment on attachment From santwana.samantray.com 2009-10-09 09:17 EDT-------


Hello Chris,

I have created an attachment for the guest's save file. Trying to restore from this file, results in segfault of libvirtd.

Thanks,
Santwana

Comment 14 Mark McLoughlin 2009-10-13 09:13:29 UTC
*** Bug 528558 has been marked as a duplicate of this bug. ***

Comment 15 Mark McLoughlin 2009-10-13 09:14:54 UTC
Dan has posted two patches fixing issues with restore:

  http://www.redhat.com/archives/libvir-list/2009-October/msg00337.html
  http://www.redhat.com/archives/libvir-list/2009-October/msg00338.html

The first one should fix this issue

Comment 16 Mark McLoughlin 2009-10-19 11:18:46 UTC
tag request: https://fedorahosted.org/rel-eng/ticket/2528

Comment 17 IBM Bug Proxy 2009-10-27 06:03:05 UTC
------- Comment From santwana.samantray.com 2009-10-27 01:51 EDT-------
Hello,

I verified this issue in the latest F12 rawhide (k.v- 2.6.31.5-96.fc12.x86_64), the issue is still existing. While trying to save the guest using " virsh save rhel5.4 /home/guest", it hangs.

The guest is not getting saved to a file. libvirt version installed is "libvirt-0.7.1-13.fc12.x86_64".

Santwana

Comment 18 Mark McLoughlin 2009-10-29 12:20:13 UTC
Santwana - please file a new bug; a 'virsh save' hang is a different bug from a libvirtd segfault during 'virsh restore'

Comment 19 IBM Bug Proxy 2009-11-05 11:21:23 UTC
------- Comment From santwana.samantray.com 2009-11-05 06:14 EDT-------
Hi,

I have opened a new bug "Bug 57446 - RH532654- virsh save hangs in Fedora12 rawhide", for "virsh save " hang issue.

Thanks,
Santwana

Comment 20 IBM Bug Proxy 2009-11-06 12:01:27 UTC
------- Comment From santwana.samantray.com 2009-11-06 06:54 EDT-------
Hello Anoop,

After applying the patch in Bug 57446 - RH532654- virsh save hangs in Fedora12 rawhide, now virsh restore also works fine. We can close this issue now.

Thanks,
Santwana