788471 – [libvirt] libvirt process stuck in D status when blocking the connection to the storage

Bug 788471 - [libvirt] libvirt process stuck in D status when blocking the connection to the storage

Summary: [libvirt] libvirt process stuck in D status when blocking the connection to t...

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	libvirt
Sub Component:
Version:	16
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Assignee:	Michal Privoznik
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2012-02-08 09:20 UTC by Kiril Nesenko
Modified:	2014-07-11 00:08 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2012-10-20 23:01:01 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
logs (648.37 KB, application/x-bzip) 2012-02-08 09:22 UTC, Kiril Nesenko	no flags	Details
View All

Description Kiril Nesenko 2012-02-08 09:20:53 UTC

Description of problem:
I have a host with NFS storage and I am blocking a connection to the storage. After a while libvirt stuck in D status. 

root     10367  0.0  0.1 1044980 17872 ?       DLl  Feb06   0:54 libvirtd --daemon --listen

Logs:

2012-02-08 08:27:47.748+0000: 10367: debug : qemuProcessKill:3227 : vm=rhel_ha pid=26459 gracefully=0
2012-02-08 08:27:47.748+0000: 10367: debug : qemuProcessAutoDestroyRemove:3749 : vm=rhel_ha uuid=1b3db5d8-5baf-46c7-b993-db3f7fc98482
2012-02-08 08:53:48.449+0000: 10367: warning : SELinuxRestoreSecurityFileLabel:519 : cannot resolve symlink /rhev/data-center/cd84d709-d762-4df6-9667-a7d0981bd8ed/3041dbba-225f-400f-ad32-09314284553e/images/9d934674-d74f-476e-8bdc-5131c0763dc0/9f8070a4-0cb5-457d-9e9c-390317a04f40: Input/output error
2012-02-08 08:53:48.450+0000: 10367: debug : virCgroupNew:602 : New group /libvirt/qemu/rhel_ha

Version-Release number of selected component (if applicable):
libvirt-0.9.6-4.fc16.x86_64
vdsm-4.9.3.2-0.fc16.x86_64

How reproducible:
Always

Steps to Reproduce:
1. Run VMs on the host and block the connection to the storage. 
2.
3.
  
Actual results:
libvirt process stuck in D status when blocking the connection to the storage 

Expected results:
libvirt should not stuck



Additional info:

Comment 1 Kiril Nesenko 2012-02-08 09:22:15 UTC

Created attachment 560196 [details]
logs

Comment 2 Kiril Nesenko 2012-02-09 12:41:47 UTC

Tested the issue on downstream with the same scenario. libvrtid stops logging, but libvirt process stays in S status.

root      2497  0.0  0.0 925324 15516 ?        SLsl Feb06   0:02 /usr/sbin/libvirtd --listen
qemu     25491  1.4  1.7 1050600 286912 ?      Sl   14:06   0:25 /usr/libexec/qemu-kvm -S -M rhel6.2.0 -cpu Conroe -enable-kvm -m 500 -smp 1,sockets=1,cores=1,threads=1 -name pin_to_host -uuid da6438b3-0ee4-457b-9ca1-07b7e54b458a -smbios type=1,manufacturer=Red Hat,product=RHEV Hypervisor,version=6Server-6.2.0.3.el6,serial=0BACEBFC-EE0D-11DF-89EA-E41F13CC3360_00:10:18:53:D5:94,uuid=da6438b3-0ee4-457b-9ca1-07b7e54b458a -nodefconfig -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/pin_to_host.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=2012-02-09T12:06:51,driftfix=slew -no-shutdown -device virtio-serial-pci,id=virtio-serial0,max_ports=16,bus=pci.0,addr=0x4 -drive file=/rhev/data-center/7e1e7286-d526-4a5d-9068-bec94ea32665/5b99aaf3-cb8e-421e-8d28-c028c082d913/images/e5303dfd-d4c5-423a-9861-be93589e349e/7a223e92-8ffa-4a2a-ae88-69db5192c811,if=none,id=drive-ide0-0-0,format=qcow2,serial=3a-9861-be93589e349e,cache=none,werror=stop,rerror=stop,aio=threads -device ide-drive,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0,bootindex=1 -drive if=none,media=cdrom,id=drive-ide0-1-0,readonly=on,format=raw -device ide-drive,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -netdev tap,fd=26,id=hostnet0,vhost=on,vhostfd=27 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=00:1a:4a:23:71:07,bus=pci.0,addr=0x3 -chardev socket,id=charchannel0,path=/var/lib/libvirt/qemu/channels/pin_to_host.com.redhat.rhevm.vdsm,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=com.redhat.rhevm.vdsm -chardev spicevmc,id=charchannel1,name=vdagent -device virtserialport,bus=virtio-serial0.0,nr=2,chardev=charchannel1,id=channel1,name=com.redhat.spice.0 -usb -spice port=5900,tls-port=5901,addr=0,x509-dir=/etc/pki/vdsm/libvirt-spice,tls-channel=main,tls-channel=inputs -k en-us -vga qxl -global qxl-vga.vram_size=67108864
qemu     25622  1.5  1.7 1050600 289972 ?      Sl   14:06   0:27 /usr/libexec/qemu-kvm -S -M rhel6.2.0 -cpu Conroe -enable-kvm -m 500 -smp 1,sockets=1,cores=1,threads=1 -name rhel_ha -uuid 1b3db5d8-5baf-46c7-b993-db3f7fc98482 -smbios type=1,manufacturer=Red Hat,product=RHEV Hypervisor,version=6Server-6.2.0.3.el6,serial=0BACEBFC-EE0D-11DF-89EA-E41F13CC3360_00:10:18:53:D5:94,uuid=1b3db5d8-5baf-46c7-b993-db3f7fc98482 -nodefconfig -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/rhel_ha.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=2012-02-09T12:06:52,driftfix=slew -no-shutdown -device virtio-serial-pci,id=virtio-serial0,max_ports=16,bus=pci.0,addr=0x4 -drive file=/rhev/data-center/7e1e7286-d526-4a5d-9068-bec94ea32665/5b99aaf3-cb8e-421e-8d28-c028c082d913/images/9d934674-d74f-476e-8bdc-5131c0763dc0/9f8070a4-0cb5-457d-9e9c-390317a04f40,if=none,id=drive-ide0-0-0,format=qcow2,serial=6e-8bdc-5131c0763dc0,cache=none,werror=stop,rerror=stop,aio=threads -device ide-drive,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0,bootindex=1 -drive if=none,media=cdrom,id=drive-ide0-1-0,readonly=on,format=raw -device ide-drive,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -netdev tap,fd=26,id=hostnet0,vhost=on,vhostfd=30 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=00:1a:4a:23:71:03,bus=pci.0,addr=0x3 -chardev socket,id=charchannel0,path=/var/lib/libvirt/qemu/channels/rhel_ha.com.redhat.rhevm.vdsm,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=com.redhat.rhevm.vdsm -chardev spicevmc,id=charchannel1,name=vdagent -device virtserialport,bus=virtio-serial0.0,nr=2,chardev=charchannel1,id=channel1,name=com.redhat.spice.0 -usb -spice port=5904,tls-port=5905,addr=0,x509-dir=/etc/pki/vdsm/libvirt-spice,tls-channel=main,tls-channel=inputs -k en-us -vga qxl -global qxl-vga.vram_size=67108864
root     28755  0.0  0.0 103304   884 pts/2    S+   14:37   0:00 grep libvirt

Versions:
libvirt-python-0.9.4-23.el6.x86_64
libvirt-client-0.9.4-23.el6.x86_64
libvirt-0.9.4-23.el6.x86_64

Comment 3 Michal Privoznik 2012-02-13 17:54:24 UTC

Kiril, from the attached logs I can see at 8:27:47.563 your machine 'rhel_ha' died (we've received EOF on the monitor). Libvirt reacts to this, and among other things, it tries to restore security selinux labels on disks used by domain.
That is - we try to access that NFS you've just cut off. AFAIK, NFS are mounted with 'soft' argument. That is, NFS should timeout after a while (2 minutes by default) unless using TCP which itself has it's own (much longer) timeouts.

During this, the process accessing dead NFS is put into D state; There is no way for process to defend that.

IIUC, after ~26 minutes, libvirt became responsible again. Am I right?

On the other hand, libvirt doesn't need to restore selinux labels on NFS; But to be able to tell if a file is on NFS, we should be able to stat() it. And I am afraid calling stat() on dead NFS will put us in the D state either.

Comment 4 Michal Privoznik 2012-02-13 17:57:05 UTC

What can be however done, is tuning up NFS mount options: timeo, retrans; but most of all - switch to proto=UDP instead of TCP.

Does this help?

Comment 5 Kiril Nesenko 2012-02-14 12:53:38 UTC

(In reply to comment #3)
> Kiril, from the attached logs I can see at 8:27:47.563 your machine 'rhel_ha'
> died (we've received EOF on the monitor). Libvirt reacts to this, and among
> other things, it tries to restore security selinux labels on disks used by
> domain.
> That is - we try to access that NFS you've just cut off. AFAIK, NFS are mounted
> with 'soft' argument. That is, NFS should timeout after a while (2 minutes by
> default) unless using TCP which itself has it's own (much longer) timeouts.
> 
> During this, the process accessing dead NFS is put into D state; There is no
> way for process to defend that.
> 
> IIUC, after ~26 minutes, libvirt became responsible again. Am I right?
>

Right it became responsive.

> On the other hand, libvirt doesn't need to restore selinux labels on NFS; But
> to be able to tell if a file is on NFS, we should be able to stat() it. And I
> am afraid calling stat() on dead NFS will put us in the D state either.

After unblocking the connection, libvirt goes to the normal process state.

Comment 7 Cole Robinson 2012-10-20 23:01:01 UTC

Reading over this it doesn't sound like there's anything to do here on the libvirt side, we are acting in accordance with the nfs mount options. Closing as NOTABUG, but please reopen if I've missed something.

Note You need to log in before you can comment on or make changes to this bug.