Bug 1022561

Summary: nova: killing qemu pid which remains after delete of instances will cause openstack-nova-compute to die
Product: Red Hat OpenStack Reporter: Dafna Ron <dron>
Component: openstack-novaAssignee: Solly Ross <sross>
Status: CLOSED DUPLICATE QA Contact: Ami Jeain <ajeain>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 4.0CC: dallan, dron, hateya, ndipanov, sclewis, sross, xqueralt, yeylon
Target Milestone: ---Keywords: Reopened, Unconfirmed, ZStream
Target Release: 4.0   
Hardware: x86_64   
OS: Linux   
Whiteboard: storage
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-04-07 15:02:53 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
log none

Description Dafna Ron 2013-10-23 14:16:18 UTC
Created attachment 815437 [details]
log

Description of problem:

I am working with gluster as cinder backend and when I boot an instance I can see two different qemu pid's run: 

[root@cougar07 ~(keystone_admin)]# ps -elf |grep qemu
2 S nova     13775 13645  6  80   0 - 212831 poll_s 16:51 ?       00:00:02 /usr/libexec/qemu-kvm -global virtio-blk-pci.scsi=off -nodefconfig -nodefaults -nographic -machine accel=kvm:tcg -cpu host,+kvmclock -m 500 -no-reboot -kernel /var/tmp/.guestfs-162/kernel.13645 -initrd /var/tmp/.guestfs-162/initrd.13645 -device virtio-scsi-pci,id=scsi -drive file=/var/lib/nova/instances/2f469c9d-2bf8-44ba-ac9a-e280289802bf/disk,cache=none,format=qcow2,id=hd0,if=none -device scsi-hd,drive=hd0 -drive file=/var/tmp/.guestfs-162/root.13645,snapshot=on,id=appliance,if=none,cache=unsafe -device scsi-hd,drive=appliance -device virtio-serial -serial stdio -device sga -chardev socket,path=/tmp/libguestfsWTO2Jc/guestfsd.sock,id=channel0 -device virtserialport,chardev=channel0,name=org.libguestfs.channel.0 -append panic=1 console=ttyS0 udevtimeout=600 no_timer_check acpi=off printk.time=1 cgroup_disable=memory root=/dev/sdb selinux=0 TERM=xterm
6 S qemu     14018     1 72  80   0 - 217865 poll_s 16:51 ?       00:00:19 /usr/libexec/qemu-kvm -name instance-00000035 -S -M rhel6.5.0 -cpu Opteron_G3,+nodeid_msr,+wdt,+skinit,+ibs,+osvw,+3dnowprefetch,+cr8legacy,+extapic,+cmp_legacy,+3dnow,+3dnowext,+pdpe1gb,+fxsr_opt,+mmxext,+ht,+vme -enable-kvm -m 512 -realtime mlock=off -smp 1,sockets=1,cores=1,threads=1 -uuid 2f469c9d-2bf8-44ba-ac9a-e280289802bf -smbios type=1,manufacturer=Red Hat Inc.,product=OpenStack Nova,version=2013.2-0.25.rc1.el6ost,serial=44454c4c-4a00-1044-804c-b5c04f39354a,uuid=2f469c9d-2bf8-44ba-ac9a-e280289802bf -nodefconfig -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/instance-00000035.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc,driftfix=slew -no-kvm-pit-reinjection -no-shutdown -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive file=/var/lib/nova/instances/2f469c9d-2bf8-44ba-ac9a-e280289802bf/disk,if=none,id=drive-virtio-disk0,format=qcow2,cache=none -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -netdev tap,fd=22,id=hostnet0,vhost=on,vhostfd=23 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=fa:16:3e:37:43:23,bus=pci.0,addr=0x3 -chardev file,id=charserial0,path=/var/lib/nova/instances/2f469c9d-2bf8-44ba-ac9a-e280289802bf/console.log -device isa-serial,chardev=charserial0,id=serial0 -chardev pty,id=charserial1 -device isa-serial,chardev=charserial1,id=serial1 -device usb-tablet,id=input0 -vnc 10.35.160.135:0 -k en-us -vga cirrus -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5

when I delete the instance I can see that only one of these pid's are left 

[root@cougar07 ~(keystone_admin)]# ps -elf |grep qemu
2 S nova     13775 13645  4  80   0 - 212831 poll_s 16:51 ?       00:00:02 /usr/libexec/qemu-kvm -global virtio-blk-pci.scsi=off -nodefconfig -nodefaults -nographic -machine accel=kvm:tcg -cpu host,+kvmclock -m 500 -no-reboot -kernel /var/tmp/.guestfs-162/kernel.13645 -initrd /var/tmp/.guestfs-162/initrd.13645 -device virtio-scsi-pci,id=scsi -drive file=/var/lib/nova/instances/2f469c9d-2bf8-44ba-ac9a-e280289802bf/disk,cache=none,format=qcow2,id=hd0,if=none -device scsi-hd,drive=hd0 -drive file=/var/tmp/.guestfs-162/root.13645,snapshot=on,id=appliance,if=none,cache=unsafe -device scsi-hd,drive=appliance -device virtio-serial -serial stdio -device sga -chardev socket,path=/tmp/libguestfsWTO2Jc/guestfsd.sock,id=channel0 -device virtserialport,chardev=channel0,name=org.libguestfs.channel.0 -append panic=1 console=ttyS0 udevtimeout=600 no_timer_check acpi=off printk.time=1 cgroup_disable=memory root=/dev/sdb selinux=0 TERM=xterm
0 S root     14135 13613  0  80   0 - 25813 pipe_w 16:52 pts/2    00:00:00 grep qemu

if I kill it nova-compute will die: 

[root@cougar07 ~(keystone_admin)]# kill -9 13645
[root@cougar07 ~(keystone_admin)]# 
[root@cougar07 ~(keystone_admin)]# 
[root@cougar07 ~(keystone_admin)]# ps -elf |grep qemu
0 S root     14140 13613  0  80   0 - 25813 pipe_w 16:52 pts/2    00:00:00 grep qemu
[root@cougar07 ~(keystone_admin)]# 
[root@cougar07 ~(keystone_admin)]# /etc/init.d/openstack-nova-compute status
openstack-nova-compute dead but pid file exists
[root@cougar07 ~(keystone_admin)]# 

Version-Release number of selected component (if applicable):

openstack-nova-compute-2013.2-0.25.rc1.el6ost.noarch

How reproducible:

100%

Steps to Reproduce:
1. configure gluster as cinder backend
2. boot an instance and run ps on qemu 
3. delete instance 
4. kill the left over pid

Actual results:

nova-compute dies as well

Expected results:

nova should not die

Additional info:

Comment 1 Solly Ross 2013-10-31 15:44:30 UTC
Can you get me logs from the other compute services (namely scheduler)?  Additionally, what is your setup (I want to make sure I duplicate it correctly) -- what nodes, etc.  Also, does this only happen when running Cinder with a Gluster backend?

Comment 2 Dafna Ron 2013-10-31 15:56:26 UTC
will give setup privately (logs are there). 
I am not sure if this is gluster related since I only have a gluster setup at this time.

Comment 3 Solly Ross 2013-10-31 19:43:18 UTC
Notes: processes are left in STAT=S (interruptible sleep) on account of poll_schedule_timeout (i.e. they are waiting for a poll call).

Comment 4 Solly Ross 2013-11-01 16:27:29 UTC
This makes me suspect that the cause is the Gluster driver.  Will do some more digging.

Comment 6 Solly Ross 2013-12-19 20:15:56 UTC
could not reproduce this anymore

Comment 7 Dafna Ron 2014-02-26 14:07:38 UTC
I'm re-openeing since this is 100% reproduced on my setup (4.0 last puddle release)
If you like to investigate further and cannot reproduce on your setup please contact me and I will give you access to my setup. 

root@puma31 ~]# ps -elf |grep qemu
0 S root      9993  9708  0  80   0 - 25813 pipe_w 16:05 pts/0    00:00:00 grep qemu
2 Z nova     16552  8454  0  80   0 -     0 exit   Feb21 ?        00:03:03 [qemu-kvm] <defunct>
2 S nova     18210  8454  0  80   0 - 212852 poll_s Feb21 ?       00:02:54 /usr/libexec/qemu-kvm -global virtio-blk-pci.scsi=off -nodefconfig -nodefaults -nographic -machine accel=kvm:tcg -cpu host,+kvmclock -m 500 -no-reboot -kernel /var/tmp/.guestfs-162/kernel.8454 -initrd /var/tmp/.guestfs-162/initrd.8454 -device virtio-scsi-pci,id=scsi -drive file=/var/lib/nova/instances/c1d544cb-1d1d-435f-9672-af5347d29c0e/disk,cache=none,format=qcow2,id=hd0,if=none -device scsi-hd,drive=hd0 -drive file=/var/tmp/.guestfs-162/root.8454,snapshot=on,id=appliance,if=none,cache=unsafe -device scsi-hd,drive=appliance -device virtio-serial -serial stdio -device sga -chardev socket,path=/tmp/libguestfsRV7oiz/guestfsd.sock,id=channel0 -device virtserialport,chardev=channel0,name=org.libguestfs.channel.0 -append panic=1 console=ttyS0 udevtimeout=600 no_timer_check acpi=off printk.time=1 cgroup_disable=memory root=/dev/sdb selinux=0 TERM=xterm
[root@puma31 ~]# ps -elf |grep qemu
0 S root      9996  9708  0  80   0 - 25813 pipe_w 16:05 pts/0    00:00:00 grep qemu
2 Z nova     16552  8454  0  80   0 -     0 exit   Feb21 ?        00:03:03 [qemu-kvm] <defunct>
2 S nova     18210  8454  0  80   0 - 212852 poll_s Feb21 ?       00:02:54 /usr/libexec/qemu-kvm -global virtio-blk-pci.scsi=off -nodefconfig -nodefaults -nographic -machine accel=kvm:tcg -cpu host,+kvmclock -m 500 -no-reboot -kernel /var/tmp/.guestfs-162/kernel.8454 -initrd /var/tmp/.guestfs-162/initrd.8454 -device virtio-scsi-pci,id=scsi -drive file=/var/lib/nova/instances/c1d544cb-1d1d-435f-9672-af5347d29c0e/disk,cache=none,format=qcow2,id=hd0,if=none -device scsi-hd,drive=hd0 -drive file=/var/tmp/.guestfs-162/root.8454,snapshot=on,id=appliance,if=none,cache=unsafe -device scsi-hd,drive=appliance -device virtio-serial -serial stdio -device sga -chardev socket,path=/tmp/libguestfsRV7oiz/guestfsd.sock,id=channel0 -device virtserialport,chardev=channel0,name=org.libguestfs.channel.0 -append panic=1 console=ttyS0 udevtimeout=600 no_timer_check acpi=off printk.time=1 cgroup_disable=memory root=/dev/sdb selinux=0 TERM=xterm
[root@puma31 ~]# 
[root@puma31 ~]# 
[root@puma31 ~]# 
[root@puma31 ~]# 
[root@puma31 ~]# kill -9 8454
[root@puma31 ~]# /etc/init.d/openstack-nova-compute status
openstack-nova-compute dead but pid file exists
[root@puma31 ~]#

Comment 11 Solly Ross 2014-02-27 21:54:51 UTC
@Dafna: is this related to https://bugzilla.redhat.com/show_bug.cgi?id=1022627 ?  Did you try the proposed fix for that?  I believe the two are related.  Please try the fix proposed upstream for the bug.

Comment 13 Dave Allan 2014-03-06 16:35:06 UTC
Solly, can you provide Dafna a scratch build with the proposed fix, or work with her to patch her systems for her so she can test?

Comment 14 Solly Ross 2014-03-18 17:58:23 UTC
I've tested the patch on the QE systems, and it appears that it works.  The backport is being checked upstream, so it should be in next time someone rebases RHOS 4.0.z off of stable/havana upstream.

Comment 17 Xavier Queralt 2014-04-07 09:44:24 UTC
I think this could be closed as a duplicate of bug 1022627

What do you think Solly?

Comment 18 Solly Ross 2014-04-07 15:02:53 UTC
@Xavier Queralt: yeah, sounds good.

*** This bug has been marked as a duplicate of bug 1022627 ***