Bug 512380

Summary: KVM Centos VM shut down spontaneously when using a SCSI disk, works with virtio/IDE
Product: [Fedora] Fedora Reporter: Joshua Rosen <bjrosen>
Component: qemuAssignee: Glauber Costa <gcosta>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: medium Docs Contact:
Priority: low    
Version: 11CC: berrange, clalance, dwmw2, ehabkost, gcosta, itamar, jaswinder, markmc, virt-maint
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-11-20 14:38:35 EST Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Description Joshua Rosen 2009-07-17 11:35:47 EDT
Description of problem: I have a 64 bit CentOS 5.3 KVM VM hosted on a 64 bit Fedora 11 machine. The host machine is a Core2 with 8G of RAM. The VM is configured to use both cores and 6G of RAM. Both the F11 host and the CentOS5.3 client are running at Init 3 (no X). I was running a heavy load on the VM, Verilog simulations on both core, the VM shutdown around 15 hours after the start of the simulations). It was in the middle of a very large grep (hundreds of files) when the shutdown occurred. The files were located on a virtual disk not on an NFS mount (NFS mounts are my usual practice, I was trying to see how much the performance improved with a virtual dis, for your infomation the simulation time went from 18 hours with NFS to 15 hours with the virtual disk which is very close to the native performance).

This system has previously running with F10 and VMware Server 2, I never experience this with VMware.

I looked in /var/log/messages for both the host and the VM, nothing obvious jumped out at me. Is there some other log file I should examine?


Version-Release number of selected component (if applicable):


How reproducible: Don't know, I'll kick off a new simulation tonight to see if it happens again.


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:
Comment 1 Glauber Costa 2009-07-17 11:46:29 EDT
Tell us more about this spontaneous shutdown.

There are now two bugs like this:

https://bugzilla.redhat.com/show_bug.cgi?id=511955
https://bugzilla.redhat.com/show_bug.cgi?id=512289

The first seems to be a SIGSEGV, while the second, a qemu termination with exit code 1, which happens when using the network.

So a couple of things:
1) Are you getting a SIGSEGV?
2) Is your simulation using the network? If it is, can you describe a little better what is it doing?

Thanks!
Comment 2 Joshua Rosen 2009-07-17 12:02:02 EDT
I'm able to reproduce the problem by doing the grep. I got the following error in the /var/log/messages

Jul 17 11:45:16 localhost kernel: qemu-kvm[28098]: segfault at 22103030 ip 00000000004c292f sp 00007fff05de8d50 error 4 in qemu-kvm[400000+1da000]
Comment 3 Joshua Rosen 2009-07-17 12:19:13 EDT
(gdb) cont
Continuing.
[New Thread 0x7f7b4abfd910 (LWP 28654)]

[Thread 0x7f7b4abfd910 (LWP 28654) exited]
[New Thread 0x7f7b4abfd910 (LWP 28658)]
[New Thread 0x7f7b4bfff910 (LWP 28659)]
[New Thread 0x7f7b4b5fe910 (LWP 28660)]

Program received signal SIGSEGV, Segmentation fault.
0x00000000004c292f in pthread_attr_setdetachstate ()
(gdb) Continuing.
[Thread 0x7f7b4b5fe910 (LWP 28660) exited]
[Thread 0x7f7b4bfff910 (LWP 28659) exited]
[Thread 0x7f7b4abfd910 (LWP 28658) exited]
[Thread 0x7f7b50c6f910 (LWP 28599) exited]
[Thread 0x7f7b51674910 (LWP 28578) exited]

Program terminated with signal SIGSEGV, Segmentation fault.
The program no longer exists.
(gdb) bt
No stack.
(gdb)
Comment 4 Glauber Costa 2009-07-17 14:27:31 EDT
So it is probably the same thing as 511955.

Can you tell me what's the output of your VM log at /var/log/libvirt/qemu ?
Comment 5 Joshua Rosen 2009-07-17 14:50:58 EDT
I cleaned out the log files and then rebooted the VM and then caused it to crash again. Here is the log file,

LC_ALL=C LD_LIBRARY_PATH= PATH=/sbin:/usr/sbin:/bin:/usr/bin HOME=/root /usr/bin/qemu-kvm -S -M pc -m 6144 -smp 2 -name avenger -uuid 25d8b1d3-bda9-5784-fe23-56df1d57043c -monitor pty -pidfile /var/run/libvirt/qemu//avenger.pid -boot c -drive file=/home/kvm/avenger.img,if=ide,index=0,boot=on -drive file=/home/kvm/local.img,if=scsi,index=0 -net nic,macaddr=00:02:b3:e7:55:b4,vlan=0 -net tap,fd=15,vlan=0 -net nic,macaddr=54:52:00:7f:da:f8,vlan=1 -net tap,fd=18,vlan=1 -serial pty -parallel none -usb -vnc 127.0.0.1:0 -soundhw es1370 
char device redirected to /dev/pts/0
char device redirected to /dev/pts/1
ALSA lib pulse.c:272:(pulse_connect) PulseAudio: Unable to connect: Connection refused

sdl: SDL_OpenAudio failed
sdl: Reason: No available audio device
sdl: SDL_OpenAudio failed
sdl: Reason: No available audio device
audio: Failed to create voice `es1370.dac2'
audio: Failed to create voice `es1370.adc'
Comment 6 Glauber Costa 2009-07-17 20:34:59 EDT
Those logs unfortunately fail to provide any information that could give us a clue about it.

As a wild guess, can you change your "local.img" to be a ide disk, or remove it entirely? scsi code in qemu is not, unfortunately, that good, and it might be worth trying.
Comment 7 Joshua Rosen 2009-07-17 21:18:32 EDT
Switching it to IDE fixed the problem, I was able to run the grep without a problem. In SCSI mode it crashes almost immediately.
Comment 8 Glauber Costa 2009-07-20 14:15:47 EDT
please test this:

http://koji.fedoraproject.org/koji/taskinfo?taskID=1487106

It won't solve your problem, but will give you a build with scsi debug messages enabled.

You'll see them in /var/log/libvirt/qemu/<machine>.log

Please post that log here.
Comment 9 Joshua Rosen 2009-07-20 15:52:27 EDT
I'm using this system for production work, I can't load a bunch of development packages on it.

You should be able to replicate this problem. The way to do it is to create a text file that's about 50K in size. You should then generate a directory tree that's four levels deep had has about 2K directories at the lowest level. Then copy the text file into each of the directories. Then do a grep on all copies of the text file.
Comment 10 Glauber Costa 2009-07-20 18:10:41 EDT
I plugged in a scsi disk of a complete distribution, and did a grep in the whole disk.

The problem did not appeared. I'll try the method you described, but the best advice I have so far, is to use ide or virtio.
Comment 11 Joshua Rosen 2009-07-20 18:34:54 EDT
IDE definitely works, I haven't tried virtio. Is there a performance difference between virtio and IDE?
Comment 12 Mark McLoughlin 2009-08-07 10:57:12 EDT
Joshua: if you can, please try and reproduce using glauber's debug package on a test system; since he can't reproduce it, there's not much we can do without a debug log

Note: qemu's SCSI emulation is probably not as reliable as its IDE emulation, so for production you're definitely better off with IDE
Comment 13 Joshua Rosen 2009-08-07 13:14:09 EDT
I'm putting together a new iCore7 box which I'll be putting through my commissioning process over the next few days. Once I have it running solidly I'll see if I can reproduce the problem on it.
Comment 14 Mark McLoughlin 2009-08-10 08:57:55 EDT
Thanks Joshua
Comment 15 Mark McLoughlin 2009-11-20 14:38:35 EST
No response to needinfo since 2009-08-10, closing