Bug 866369
Summary: | libvirt: terminating vm on signal 15 when hibernate fails on ENOSPACE | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Dafna Ron <dron> | ||||||
Component: | libvirt | Assignee: | Michal Privoznik <mprivozn> | ||||||
Status: | CLOSED ERRATA | QA Contact: | Virtualization Bugs <virt-bugs> | ||||||
Severity: | high | Docs Contact: | |||||||
Priority: | high | ||||||||
Version: | 6.3 | CC: | acathrow, cpelland, dallan, dyasny, dyuan, hateya, jdenemar, mburns, mzhan, rwu, whuang | ||||||
Target Milestone: | rc | Keywords: | ZStream | ||||||
Target Release: | --- | ||||||||
Hardware: | x86_64 | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | libvirt-0.10.2-8.el6 | Doc Type: | Bug Fix | ||||||
Doc Text: |
Under certain circumstances, the iohelper process failed to write data to disk while saving a domain and kernel did not report an out-of-space error (ENOSPC). With this update, libvirt calls the fdatasync() function in the described scenario to force the data to be written to disk or catch a write error. As a result, if a write error occurs, it is now properly caught and reported.
|
Story Points: | --- | ||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2013-02-21 07:09:55 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 873292 | ||||||||
Attachments: |
|
Hi, Dafna I can not reproduce this bug , I install a vm then cram the storage pool then in the guest do pm-hibernate the guest is pause in the rhevm with error: "VM bz369 has paused due to Storage space error" then check ps grep qemu qemu process is still there host: libvirt-0.9.10-21.el6_3.5.x86_64 qemu-kvm-rhev-0.12.1.2-2.295.el6_3.2.x86_64 vdsm-4.9.6-38.0.el6_3.x86_64 Wenlong are you hibernating the vm? the vm has enough space to run but when you hibernate it performs migrate to file which fails on ENOSPACE that causes qemu to crash. also, are you using nfs or block? it happened to me with iscsi storage. (In reply to comment #2) > are you hibernating the vm? > the vm has enough space to run but when you hibernate it performs migrate to > file which fails on ENOSPACE that causes qemu to crash. > > also, are you using nfs or block? it happened to me with iscsi storage. I run the pm-hibernate in the vm console ,I do not find "hibernate" button in the RHEVM , which cmd or button do you use? I use NFS storage vm console will not hibernate the vm. you need rhevm -> Virtual Machines tab -> select the vm -> right click -> suspend Hi, Dafna I try it with ISCSI storage domain and rhevm->suspend guest when storage domain is Low disk space ,but I still can not reproduce this bug . Could you verify this bug once it would be fixed ? Thanks very much. libvirt-0.9.10-21.el6_3.5.x86_64 qemu-kvm-rhev-0.12.1.2-2.295.el6_3.2.x86_64 vdsm-4.9.6-38.0.el6_3.x86_64 Wenlong I can verify this bug but since I work on a downstream system I would need an rpm to verify. I managed to fine-tune this and it might be a race. in my setup I have other storage domains with other vms running on them. run 6 vm's on other storage domain and 1 vm on the small storage domain. all vm's are 15GB disk thin provision. suspend all vm's one at a time when the last one is the vm on the small domain. it still reproduces on my setup so I can verify for you but this info might be helpful. (In reply to comment #7) Thanks Dafna for the details and promising for verification! We will try again with your latest comment 7 to check if we can reproduce it or not. If still not, we will need your help to verify when the pkg, fixed this bug, comes out then. Dafna, on which VM from attached log did you encounter this bug? BTW: I cannot see anything wrong so far. I mean qemu reports 'migration completed' (save to file is just a migration into that file) so libvirt kills the process. (In reply to comment #9) > Dafna, > > on which VM from attached log did you encounter this bug? RHEL6-01 is the vm (you have the vm log also in the attachment). > BTW: I cannot see anything wrong so far. I mean qemu reports 'migration > completed' (save to file is just a migration into that file) so libvirt > kills the process. in that case I have a theory :) Is it possible that the issue is that there are two different flows here. hibernate should migrate to file -> kill pid if we have ENOSPACE on vm it is moved to pause if during the migrate the vm gets ENOSPACE and is moved to status paused instead of suspended, when migration is complete since the vm status was changed to pause and not to suspended the vm is killed. This is not a case of live migration. Guest is paused just before migration. Not vice versa - guest paused due to I/O error. Anyway, I cannot confirm your theory since it's all done within qemu. But, reading the logs I see this: 2012-10-14 15:55:14.040+0000: 9552: debug : qemuMonitorSend:823 : QEMU_MONITOR_SEND_MSG: mon=0x7f3ff4008e70 msg={"execute":"query-migrate","id":"libvirt-289"} fd=-1 2012-10-14 15:55:14.041+0000: 9546: debug : qemuMonitorRef:201 : QEMU_MONITOR_REF: mon=0x7f3ff4008e70 refs=4 2012-10-14 15:55:14.041+0000: 9546: debug : qemuMonitorIOWrite:432 : QEMU_MONITOR_IO_WRITE: mon=0x7f3ff4008e70 buf={"execute":"query-migrate","id":"libvirt-289"} len=48 ret=48 errno=11 2012-10-14 15:55:14.041+0000: 9546: debug : qemuMonitorUnref:210 : QEMU_MONITOR_UNREF: mon=0x7f3ff4008e70 refs=3 2012-10-14 15:55:16.366+0000: 9546: debug : qemuMonitorRef:201 : QEMU_MONITOR_REF: mon=0x7f3ff4008e70 refs=4 2012-10-14 15:55:16.366+0000: 9546: debug : qemuMonitorIOProcess:327 : QEMU_MONITOR_IO_PROCESS: mon=0x7f3ff4008e70 buf={"return": {"status": "completed"}, "id": "libvirt-289"} len=58 [...] 2012-10-14 15:55:20.570+0000: 9552: debug : qemuProcessStop:3763 : Shutting down VM 'RHEL6-01' pid=4188 migrated=0 where 9552 is the ID of thread executing virDomainSave to iSCSI target. It can be seen that qemu reports 'completed', to which libvirt reacts by killing the domain. However, qemu shouldn't have reported success in the first place. On the other hand, the actual write of data is done by libvirt (iohelper to be precise). QEMU is just being passed a FD to which it just poke data. But for some reason, we don't read stderr of iohelper nor report it anywhere. When I wrote a small patch, I got report from iohelper like '/usr/libexec/libvirt_iohelper: Unable to write <path>: Input/output error'. So I think we should do the following, clone this bug to qemu-kvm so they can report status correctly, when migration FD is closed. And leave this one open, so I can post patch I am mentioning a few lines above. (In reply to comment #7) > I managed to fine-tune this and it might be a race. > in my setup I have other storage domains with other vms running on them. > run 6 vm's on other storage domain and 1 vm on the small storage domain. > all vm's are 15GB disk thin provision. > suspend all vm's one at a time when the last one is the vm on the small > domain. > it still reproduces on my setup so I can verify for you but this info might > be helpful. In my env I still can not reproduce this bug , rhevm report No space error before vm try to suspend . Patch proposed upstream: https://www.redhat.com/archives/libvir-list/2012-October/msg01257.html Dafna, I've been thinking a lot about this one lately. And I think I might have found one corner case where there might be a bug in libvirt. However I've tried to reproduce it but without any luck. Everytime the domain wasn't killed. So if you can reproduce and attach full debug logs (that is without any log_filter applied in libvirtd.conf and attach /var/log/libvirt/qemu/$domain.log as well) that would be great. thanks in advance. Dafna, I've spin a scratch build for you which should log all iohelper's error messages (iohelper is a small program within libvirt package that does write qemu migration data onto disk). You still need to remove log_filters though. https://brewweb.devel.redhat.com/taskinfo?taskID=5026359 Dafna, so I may have found something. Long story short, write() doesn't ensure data is actually written onto disk (NFS share, iSCSI target, whatever). So libvirt iohelper must call fsync() in the end to enforce flush of OS caches and hence catch ENOSPACE. However, this may hurt performance, since fsync() blocks until data is written for sure. So I've created another scratch build. Lets forget the previous one and try this. It contains all the patches from the previous one plus one iohelper's patch to call fsync() as described above. So please, can you give it a try and see if you experience reported behaviour? Moreover, do you notice any performance drops? https://brewweb.devel.redhat.com/taskinfo?taskID=5030495 Thanks a lot! it seems that the patch works. the vm did not pause and migration was successful. It seems that the migrate took a bit longer but not significantly. I will upload the logs Created attachment 636177 [details]
logs
Moving to POST: http://post-office.corp.redhat.com/archives/rhvirt-patches/2012-November/msg00012.html http://post-office.corp.redhat.com/archives/rhvirt-patches/2012-November/msg00014.html http://post-office.corp.redhat.com/archives/rhvirt-patches/2012-November/msg00015.html We need this patch as well: http://post-office.corp.redhat.com/archives/rhvirt-patches/2012-November/msg00057.html verified on libvirt-0.10.2-8.el6 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHSA-2013-0276.html |
Created attachment 627239 [details] logs Description of problem: I hibernated a vm through rhevm and there was not enough space on storage domain. the vm pauses and than terminated on signal 15 Version-Release number of selected component (if applicable): libvirt-0.9.10-21.el6_3.5.x86_64 vdsm-4.9.6-37.0.el6_3.x86_64 qemu-img-rhev-0.12.1.2-2.295.el6_3.2.x86_64 qemu-kvm-rhev-0.12.1.2-2.295.el6_3.2.x86_64 How reproducible: 100% Steps to Reproduce: 1. create a small storage domain and try to hibernate a vm 2. 3. Actual results: the hibernate should fails on ENOSPACE, vm becomes paused and qemu/libvirt terminate the vm on signal 15 Expected results: vm should pause but not be killed Additional info: