Bug 866369

Summary: libvirt: terminating vm on signal 15 when hibernate fails on ENOSPACE
Product: Red Hat Enterprise Linux 6 Reporter: Dafna Ron <dron>
Component: libvirtAssignee: Michal Privoznik <mprivozn>
Status: CLOSED ERRATA QA Contact: Virtualization Bugs <virt-bugs>
Severity: high Docs Contact:
Priority: high    
Version: 6.3CC: acathrow, cpelland, dallan, dyasny, dyuan, hateya, jdenemar, mburns, mzhan, rwu, whuang
Target Milestone: rcKeywords: ZStream
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: libvirt-0.10.2-8.el6 Doc Type: Bug Fix
Doc Text:
Under certain circumstances, the iohelper process failed to write data to disk while saving a domain and kernel did not report an out-of-space error (ENOSPC). With this update, libvirt calls the fdatasync() function in the described scenario to force the data to be written to disk or catch a write error. As a result, if a write error occurs, it is now properly caught and reported.
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-02-21 07:09:55 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 873292    
Attachments:
Description Flags
logs
none
logs none

Description Dafna Ron 2012-10-15 08:30:28 UTC
Created attachment 627239 [details]
logs

Description of problem:

I hibernated a vm through rhevm and there was not enough space on storage domain.  
the vm pauses and than terminated on signal 15 

Version-Release number of selected component (if applicable):

libvirt-0.9.10-21.el6_3.5.x86_64
vdsm-4.9.6-37.0.el6_3.x86_64
qemu-img-rhev-0.12.1.2-2.295.el6_3.2.x86_64
qemu-kvm-rhev-0.12.1.2-2.295.el6_3.2.x86_64

How reproducible:

100%

Steps to Reproduce:
1. create a small storage domain and try to hibernate a vm
2.
3.
  
Actual results:

the hibernate should fails on ENOSPACE, vm becomes paused and qemu/libvirt terminate the vm on signal 15

Expected results:

vm should pause but not be killed 
Additional info:

Comment 1 Huang Wenlong 2012-10-16 09:44:10 UTC
Hi, Dafna 

I can not  reproduce this bug  , I install a vm then cram the storage pool 

then in the guest do pm-hibernate 
the guest is pause in the rhevm with error: 
"VM bz369 has paused due to Storage space error"
then check ps grep qemu 
qemu process is still there 


host:
libvirt-0.9.10-21.el6_3.5.x86_64
qemu-kvm-rhev-0.12.1.2-2.295.el6_3.2.x86_64
vdsm-4.9.6-38.0.el6_3.x86_64




Wenlong

Comment 2 Dafna Ron 2012-10-16 09:54:48 UTC
are you hibernating the vm? 
the vm has enough space to run but when you hibernate it performs migrate to file which fails on ENOSPACE that causes qemu to crash. 

also, are you using nfs or block? it happened to me with iscsi storage.

Comment 3 Huang Wenlong 2012-10-16 10:27:48 UTC
(In reply to comment #2)
> are you hibernating the vm? 
> the vm has enough space to run but when you hibernate it performs migrate to
> file which fails on ENOSPACE that causes qemu to crash. 
> 
> also, are you using nfs or block? it happened to me with iscsi storage.

I run the pm-hibernate in the vm console ,I do not find "hibernate" button in the RHEVM , which cmd or button do you use? 
I use NFS storage

Comment 4 Dafna Ron 2012-10-16 10:32:16 UTC
vm console will not hibernate the vm. 
you need rhevm -> Virtual Machines tab -> select the vm -> right click -> suspend

Comment 5 Huang Wenlong 2012-10-18 02:44:11 UTC
Hi, Dafna

I try it with ISCSI storage domain and rhevm->suspend guest when storage domain is Low disk space ,but I still can not reproduce this bug .
Could you verify this bug once it would be  fixed ? Thanks very much.

libvirt-0.9.10-21.el6_3.5.x86_64
qemu-kvm-rhev-0.12.1.2-2.295.el6_3.2.x86_64
vdsm-4.9.6-38.0.el6_3.x86_64


Wenlong

Comment 6 Dafna Ron 2012-10-18 08:57:14 UTC
I can verify this bug but since I work on a downstream system I would need an rpm to verify.

Comment 7 Dafna Ron 2012-10-18 10:11:20 UTC
I managed to fine-tune this and it might be a race.  
in my setup I have other storage domains with other vms running on them. 
run 6 vm's on other storage domain and 1 vm on the small storage domain. 
all vm's are 15GB disk thin provision.
suspend all vm's one at a time when the last one is the vm on the small domain. 
it still reproduces on my setup so I can verify for you but this info might be helpful.

Comment 8 Min Zhan 2012-10-18 10:27:48 UTC
(In reply to comment #7)

Thanks Dafna for the details and promising for verification!

We will try again with your latest comment 7 to check if we can reproduce it or not. If still not, we will need your help to verify when the pkg, fixed this bug, comes out then.

Comment 9 Michal Privoznik 2012-10-22 10:31:19 UTC
Dafna,

on which VM from attached log did you encounter this bug?
BTW: I cannot see anything wrong so far. I mean qemu reports 'migration completed' (save to file is just a migration into that file) so libvirt kills the process.

Comment 10 Dafna Ron 2012-10-22 11:57:40 UTC
(In reply to comment #9)
> Dafna,
> 
> on which VM from attached log did you encounter this bug?

RHEL6-01 is the vm (you have the vm log also in the attachment). 


> BTW: I cannot see anything wrong so far. I mean qemu reports 'migration
> completed' (save to file is just a migration into that file) so libvirt
> kills the process.

in that case I have a theory :) 

Is it possible that the issue is that there are two different flows here. 
hibernate should migrate to file -> kill pid
if we have ENOSPACE on vm it is moved to pause 

if during the migrate the vm gets ENOSPACE and is moved to status paused instead of suspended, when migration is complete since the vm status was changed to pause and not to suspended the vm is killed.

Comment 11 Michal Privoznik 2012-10-22 14:58:29 UTC
This is not a case of live migration. Guest is paused just before migration. Not vice versa - guest paused due to I/O error.

Anyway, I cannot confirm your theory since it's all done within qemu. But, reading the logs I see this:


2012-10-14 15:55:14.040+0000: 9552: debug : qemuMonitorSend:823 : QEMU_MONITOR_SEND_MSG: mon=0x7f3ff4008e70 msg={"execute":"query-migrate","id":"libvirt-289"}
 fd=-1
2012-10-14 15:55:14.041+0000: 9546: debug : qemuMonitorRef:201 : QEMU_MONITOR_REF: mon=0x7f3ff4008e70 refs=4
2012-10-14 15:55:14.041+0000: 9546: debug : qemuMonitorIOWrite:432 : QEMU_MONITOR_IO_WRITE: mon=0x7f3ff4008e70 buf={"execute":"query-migrate","id":"libvirt-289"}
 len=48 ret=48 errno=11
2012-10-14 15:55:14.041+0000: 9546: debug : qemuMonitorUnref:210 : QEMU_MONITOR_UNREF: mon=0x7f3ff4008e70 refs=3
2012-10-14 15:55:16.366+0000: 9546: debug : qemuMonitorRef:201 : QEMU_MONITOR_REF: mon=0x7f3ff4008e70 refs=4
2012-10-14 15:55:16.366+0000: 9546: debug : qemuMonitorIOProcess:327 : QEMU_MONITOR_IO_PROCESS: mon=0x7f3ff4008e70 buf={"return": {"status": "completed"}, "id": "libvirt-289"}
 len=58
[...]
2012-10-14 15:55:20.570+0000: 9552: debug : qemuProcessStop:3763 : Shutting down VM 'RHEL6-01' pid=4188 migrated=0

where 9552 is the ID of thread executing virDomainSave to iSCSI target. It can be seen that qemu reports 'completed', to which libvirt reacts by killing the domain. However, qemu shouldn't have reported success in the first place. On the other hand, the actual write of data is done by libvirt (iohelper to be precise). QEMU is just being passed a FD to which it just poke data. But for some reason, we don't read stderr of iohelper nor report it anywhere. When I wrote a small patch, I got report from iohelper like '/usr/libexec/libvirt_iohelper: Unable to write <path>: Input/output error'.

So I think we should do the following, clone this bug to qemu-kvm so they can report status correctly, when migration FD is closed.
And leave this one open, so I can post patch I am mentioning a few lines above.

Comment 12 Huang Wenlong 2012-10-23 02:33:18 UTC
(In reply to comment #7)
> I managed to fine-tune this and it might be a race.  
> in my setup I have other storage domains with other vms running on them. 
> run 6 vm's on other storage domain and 1 vm on the small storage domain. 
> all vm's are 15GB disk thin provision.
> suspend all vm's one at a time when the last one is the vm on the small
> domain. 
> it still reproduces on my setup so I can verify for you but this info might
> be helpful.

In my env I still can not reproduce this bug , rhevm report No space error before vm try to suspend .

Comment 13 Michal Privoznik 2012-10-23 13:00:07 UTC
Patch proposed upstream:

https://www.redhat.com/archives/libvir-list/2012-October/msg01257.html

Comment 14 Michal Privoznik 2012-10-24 15:50:25 UTC
Another try:

https://www.redhat.com/archives/libvir-list/2012-October/msg01436.html

Comment 15 Michal Privoznik 2012-10-25 14:32:23 UTC
Dafna,

I've been thinking a lot about this one lately. And I think I might have found one corner case where there might be a bug in libvirt. However I've tried to reproduce it but without any luck. Everytime the domain wasn't killed. So if you can reproduce and attach full debug logs (that is without any log_filter applied in libvirtd.conf and attach /var/log/libvirt/qemu/$domain.log as well) that would be great. thanks in advance.

Comment 17 Michal Privoznik 2012-10-29 13:57:27 UTC
Dafna,

I've spin a scratch build for you which should log all iohelper's error messages (iohelper is a small program within libvirt package that does write qemu migration data onto disk). You still need to remove log_filters though.

https://brewweb.devel.redhat.com/taskinfo?taskID=5026359

Comment 20 Michal Privoznik 2012-10-30 18:57:21 UTC
Dafna,

so I may have found something. Long story short, write() doesn't ensure data is actually written onto disk (NFS share, iSCSI target, whatever). So libvirt iohelper must call fsync() in the end to enforce flush of OS caches and hence catch ENOSPACE. However, this may hurt performance, since fsync() blocks until data is written for sure.

So I've created another scratch build. Lets forget the previous one and try this. It contains all the patches from the previous one plus one iohelper's patch to call fsync() as described above. So please, can you give it a try and see if you experience reported behaviour? Moreover, do you notice any performance drops?

https://brewweb.devel.redhat.com/taskinfo?taskID=5030495

Thanks a lot!

Comment 22 Dafna Ron 2012-10-31 15:39:26 UTC
it seems that the patch works. 
the vm did not pause and migration was successful. 

It seems that the migrate took a bit longer but not significantly. 

I will upload the logs

Comment 23 Dafna Ron 2012-10-31 15:40:05 UTC
Created attachment 636177 [details]
logs

Comment 27 Michal Privoznik 2012-11-05 16:23:24 UTC
We need this patch as well:

http://post-office.corp.redhat.com/archives/rhvirt-patches/2012-November/msg00057.html

Comment 31 Dafna Ron 2012-11-22 15:37:42 UTC
verified on libvirt-0.10.2-8.el6

Comment 32 errata-xmlrpc 2013-02-21 07:09:55 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2013-0276.html