Bug 1045833
Summary: | Fail migration when VM get paused due to EIO | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Ilanit Stein <istein> | ||||||||||
Component: | libvirt | Assignee: | Martin Kletzander <mkletzan> | ||||||||||
Status: | CLOSED ERRATA | QA Contact: | Ilanit Stein <istein> | ||||||||||
Severity: | high | Docs Contact: | |||||||||||
Priority: | unspecified | ||||||||||||
Version: | 6.5 | CC: | berrange, dyuan, huding, istein, jdenemar, juzhang, juzhou, michal.skrivanek, rbalakri, weizhan, ydu, zpeng | ||||||||||
Target Milestone: | rc | ||||||||||||
Target Release: | --- | ||||||||||||
Hardware: | Unspecified | ||||||||||||
OS: | Unspecified | ||||||||||||
Whiteboard: | |||||||||||||
Fixed In Version: | libvirt-0.10.2-35.el6 | Doc Type: | Bug Fix | ||||||||||
Doc Text: | Story Points: | --- | |||||||||||
Clone Of: | 972675 | Environment: | |||||||||||
Last Closed: | 2014-10-14 04:19:27 UTC | Type: | Bug | ||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||
Documentation: | --- | CRM: | |||||||||||
Verified Versions: | Category: | --- | |||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
Embargoed: | |||||||||||||
Bug Depends On: | 972675 | ||||||||||||
Bug Blocks: | 886416, 961154 | ||||||||||||
Attachments: |
|
Comment 1
Ilanit Stein
2013-12-22 12:30:32 UTC
Could you provide the relevant info in this bugzilla instead of cross referencing to other places? What did you do, what was the result, and what log messages were generated by libvirt while you were doing that? Background: ========== The implementation of "Fail migration when VM get paused due to EIO" involves 3 parts: [1] backend bug 886416, which send the AbortOnError=True flag - I verified this bug, since I sew in logs backend indeed passes this flag. [2] vdsm bug 961154 - which passes the AbortOnError flag to libvirt. - I verified this bug, as vdsm is indeed passing the flag to libvirt. [3] This bug, which is actually closed bug 972675, that suppose to respond to the AbortOnError flag. - I sew a problem in responding to this flag. I documented it in bug 886416. And for convenience, I copy it here: What did I do? ============= While VM migration, I blocked storage (iptables -A OUTPUT -d <storage server> -j REJECT) on source host. What was the result? =================== VM event "VM not responding" appeared, then VM changed from up to "?", then it was migrated (automatically), and on destination it was in status pause. vfeenstr inquiry: ================ "I see that this seems not to be working. When I am dropping all packets from the storage server on the source host side then the migration still continues, however the VM gets paused on the source. However the VM migration does not get finalized and stays stuck at about 99% and eventually should get aborted due to our no-progress timeout, however even that does not happen. If I allow then the connection to the storage server again suddenly we get a bunch of events and surprisingly even though we called abortJob the migration suceeds and the VM is destination on the target, and eventually happened to run again on the destination. To me it looks like that libvirt got stuck because of the EIO, however the abortonerror flag definitely does NOT work. These are the events I was referring to: Thread-197::DEBUG::2013-12-17 16:24:10,112::fileSD::154::Storage.StorageDomain::(__init__) Reading domain in path /rhev/data-center/mnt/10.35.64.102:_volumes_wolf_data__0__nfs__2013082717821706028/7ac99a5a-1542-4757-b7fc-c43ac2caf8f4 libvirtEventLoop::INFO::2013-12-17 16:24:10,112::vm::4366::vm.Vm::(_onAbnormalStop) vmId=`f7afb5e9-9f6a-4cca-8ef5-7d02e9a76969`::abnormal vm stop device virtio-disk0 error eio libvirtEventLoop::INFO::2013-12-17 16:24:10,120::vm::4366::vm.Vm::(_onAbnormalStop) vmId=`f7afb5e9-9f6a-4cca-8ef5-7d02e9a76969`::abnormal vm stop device virtio-disk0 error eio Thread-122::DEBUG::2013-12-17 16:24:10,122::fileSD::154::Storage.StorageDomain::(__init__) Reading domain in path /rhev/data-center/mnt/10.35.64.102:_volumes_wolf_export__istein__0__nfs__2013110512489493611/82d5b7d4-100b-4073-9496-cc3810c24168 libvirtEventLoop::INFO::2013-12-17 16:24:10,123::vm::4366::vm.Vm::(_onAbnormalStop) vmId=`f7afb5e9-9f6a-4cca-8ef5-7d02e9a76969`::abnormal vm stop device virtio-disk0 error eio libvirtEventLoop::INFO::2013-12-17 16:24:10,129::vm::4366::vm.Vm::(_onAbnormalStop) vmId=`f7afb5e9-9f6a-4cca-8ef5-7d02e9a76969`::abnormal vm stop device virtio-disk0 error eio . . . " Attaching the VDSM & libvirtd logs of the EIO issues. Created attachment 846626 [details]
VDSM & libvirtd logs of the EIO issues
In your setup, what was the setting of the nfs mount? particularly options retrans, timeo and intr/nointr. How long did you wait before allowing the communication again? According to the logs, it looks like the event about I/O Error came at the same time when info about migration status returning "completed". The ultimate question is whether to cancel such migration or not. On one hand, there is some kind of I/O error (can it be resumed on destination? is it safe?), but OTOH the migration itself reported completion status. Looking at this from different view, we disable migrations of domains paused due to I/O error, but this domain didn't even have the time to be paused due to that I/O error, because it was completed. What's your view on that? (In reply to Martin Kletzander from comment #8) > In your setup, what was the setting of the nfs mount? particularly options > retrans, timeo and intr/nointr. How long did you wait before allowing the > communication again? > > According to the logs, it looks like the event about I/O Error came at the > same time when info about migration status returning "completed". The > ultimate question is whether to cancel such migration or not. On one hand, > there is some kind of I/O error (can it be resumed on destination? is it > safe?), but OTOH the migration itself reported completion status. > > Looking at this from different view, we disable migrations of domains paused > due to I/O error, but this domain didn't even have the time to be paused due > to that I/O error, because it was completed. What's your view on that? 1. I'll check how to fetch setting of "nfs mount" you are asking for. 2. I think that in case the migration was completed, I/O error on the source host, should not have affect. 3. The libvirt abortOnError flag response should be tested, in such way that will assure migration will not be completed, while this flag is sent. Michal, Would you please give your input on this? After discussion with Michal, we believe I/O error received at the same time when migration has completed (actually immediately before that) should still be treated as an error due to the lack of precise information about the disk state. Since it results in pretty trivial patch, I'd suggest taking it in. Fixed upstream by v1.2.4-rc1-18-g837154a: commit 837154a151ce056caba73e60c641a705136fb965 Author: Martin Kletzander <mkletzan> Date: Thu Apr 24 16:28:36 2014 +0200 qemu: properly quit migration with abort_on_error I try this on libvirt-0.10.2-35.el6 soft mount nfs both on source and target host #mount $nfs-server:/var/lib/libvirt/images/ on /var/lib/libvirt/migrate type nfs (rw,soft,timeo=15,retrans=3,nosharecache,vers=4,addr=$nfs_ip,clientaddr=source_ip) prepare a healthy guest using image on nfs. set migrate speed to low speed(take a long time to finish migration) #virsh migrate-setspeed rhel6.5 10 do live migration: # virsh migrate --live rhel6.5 qemu+ssh://target_ip/system --verbose --timeout 30 --abort-on-error while migration, using iptables blocked nfs. #iptables -A OUTPUT -d $nfs_server_ip -j REJECT result: Migration: [ 67 %]error: operation failed: migration job: failed due to I/O error on target,no guest exist. I'm not sure this verification is enough,waiting Ilanit's confirm. Hi istein, Could you help to give a try with the latest libvirt version on RHEL6.6 for this bug ? Thanks. dyuan. The RHEL6.6 vdsm installation is now pending missing repos. I'll update once I'll have this done. Hi Ilanit, Is there any progress from your side ? All the RHEL6.6 errata should be in REL_PREP by Sep 17, so we need to verify this bug asap. Thanks. Tested on 2 rhel6.6 hosts (Red Hat Enterprise Linux Server release 6.6 Beta (Santiago)) vdsm-4.16.3-3.el6ev.beta.x86_64 libvirt-0.10.2-45.el6.x86_64 Migration was not aborted. While VM migration, I blocked storage by: iptables -A OUTPUT -d <storage server> -j REJECT on source. migration started at Sep 16, 09:24 After ~3 min VM turned into "not responding" state ("?"), and then automatic migration retries. In vdsm.log: 1. The abortOnError flag = 'true'. 2. There libvirt Errors: Thread-31::DEBUG::2014-09-16 09:23:30,341::libvirtconnection::143::root::(wrapper) Unknown libvirterror: ecode: 80 edom: 20 level: 2 message: metadata not found: Requested metadata element is not p resent Thread-33::DEBUG::2014-09-16 09:25:49,001::libvirtconnection::143::root::(wrapper) Unknown libvirterror: ecode: 68 edom: 10 level: 2 message: Timed out during operation: cannot acquire state change lock Ilanit, did QEMU report EIO to libvirt? I doubt it since it likely got stuck in d-state. Is NFS mounted in soft mode with appropriate timeouts so that QEMU can even get EIO from the kernel? In other words, if kernel doesn't send EIO to QEMU so that it can pass it to libvirt, there's nothing we can do. I think the verification in comment 16 was correct, while comment 20 doesn't tells us anything useful about this bug... Verification in comment 16 was correct. There is no detailed description, no helpful logs (seeing EIO there), well nothing helpful in comment 20. I'm moving it back to ON_QA as there is nothing to do from DEV POV. Created attachment 937958 [details] src host vdsm.log for comment #20 Created attachment 937962 [details] dst host vdsm.log for comment #20 Could you try answering the questions I asked in comment 21? I do not know how to check these questions. Are you able to elaborate on how exactly I can check it?, I'll be happy to do so then. To get the options used when mounting NFS, type "mount" and look for the line describing the NFS mount point (or just attach the complete output of the mount command). To check whether EIO was sent, we need libvirtd.logs (which seem to already be attached) and QEMU log which can be found at /var/log/libvirt/qemu/DOMAIN_NAME.log. The libvirt logs are missing! I am debugging it, and update later. Here's the other inputs you've requested: mount output on src host: /dev/mapper/vg0-lv_root on / type ext4 (rw) proc on /proc type proc (rw) sysfs on /sys type sysfs (rw) devpts on /dev/pts type devpts (rw,gid=5,mode=620) tmpfs on /dev/shm type tmpfs (rw,rootcontext="system_u:object_r:tmpfs_t:s0") /dev/sda1 on /boot type ext4 (rw) /dev/mapper/vg0-lv_home on /home type ext4 (rw) none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw) sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw) 10.35.160.108:/RHEV/istein10 on /rhev/data-center/mnt/10.35.160.108:_RHEV_istein10 type nfs (rw,soft,nosharecache,timeo=600,retrans=6,nfsvers=3,addr=10.35.160.108) mount output on dst host: [root@alma02 ~]# mount /dev/sda3 on / type ext3 (rw) proc on /proc type proc (rw) sysfs on /sys type sysfs (rw) devpts on /dev/pts type devpts (rw,gid=5,mode=620) tmpfs on /dev/shm type tmpfs (rw,rootcontext="system_u:object_r:tmpfs_t:s0") /dev/sda1 on /boot type ext3 (rw) none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw) sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw) 10.35.160.108:/RHEV/istein10 on /rhev/data-center/mnt/10.35.160.108:_RHEV_istein10 type nfs (rw,soft,nosharecache,timeo=600,retrans=6,nfsvers=3,addr=10.35.160.108) QEMU log on src host (relevant part from that date): 2014-09-16 06:23:07.048+0000: starting up LC_ALL=C PATH=/usr/local/sbin:/usr/local/bin:/usr/bin:/usr/sbin:/sbin:/bin QEMU_AUDIO_DRV=spice /usr/libexec/qemu-kvm -name rhel6 -S -M rhel6.5.0 -cpu SandyBridge -enable-kvm -m 4096 -realtime mlock=off -smp 1,maxcpus=16,sockets=16,cores=1,threads=1 -uuid 974b439c-09d9-4229-a989-692c3f7ce8c6 -smbios type=1,manufacturer=Red Hat,product=RHEV Hypervisor,version=6Server-6.5.0.1.el6,serial=4C4C4544-0059-4410-8054-B4C04F573032,uuid=974b439c-09d9-4229-a989-692c3f7ce8c6 -nodefconfig -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/rhel6.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=2014-09-16T06:23:06,driftfix=slew -no-kvm-pit-reinjection -no-shutdown -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -device virtio-scsi-pci,id=scsi0,bus=pci.0,addr=0x4 -device virtio-serial-pci,id=virtio-serial0,max_ports=16,bus=pci.0,addr=0x5 -drive if=none,media=cdrom,id=drive-ide0-1-0,readonly=on,format=raw,serial= -device ide-drive,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -drive file=/rhev/data-center/00000002-0002-0002-0002-000000000047/48a3ca81-bb3e-46dd-9907-4e84eb1ba23e/images/af26f942-163e-4b0e-a8d5-e2fd06c66afd/eb0d132f-f6f9-44aa-907e-bb97f593310e,if=none,id=drive-virtio-disk0,format=raw,serial=af26f942-163e-4b0e-a8d5-e2fd06c66afd,cache=none,werror=stop,rerror=stop,aio=threads -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x6,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -netdev tap,fd=28,id=hostnet0,vhost=on,vhostfd=29 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=00:1a:4a:16:93:02,bus=pci.0,addr=0x3 -chardev socket,id=charchannel0,path=/var/lib/libvirt/qemu/channels/974b439c-09d9-4229-a989-692c3f7ce8c6.com.redhat.rhevm.vdsm,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=com.redhat.rhevm.vdsm -chardev socket,id=charchannel1,path=/var/lib/libvirt/qemu/channels/974b439c-09d9-4229-a989-692c3f7ce8c6.org.qemu.guest_agent.0,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=2,chardev=charchannel1,id=channel1,name=org.qemu.guest_agent.0 -chardev spicevmc,id=charchannel2,name=vdagent -device virtserialport,bus=virtio-serial0.0,nr=3,chardev=charchannel2,id=channel2,name=com.redhat.spice.0 -spice tls-port=5900,addr=10.35.117.20,x509-dir=/etc/pki/vdsm/libvirt-spice,tls-channel=main,tls-channel=display,tls-channel=inputs,tls-channel=cursor,tls-channel=playback,tls-channel=record,tls-channel=smartcard,tls-channel=usbredir,seamless-migration=on -k en-us -vga qxl -global qxl-vga.ram_size=67108864 -global qxl-vga.vram_size=33554432 -incoming tcp:[::]:49152 -msg timestamp=on block I/O error in device 'drive-virtio-disk0': Input/output error (5) 2014-09-16 07:17:43.351+0000: shutting down qemu: terminating on signal 15 from pid 118423 QEMU log on dst host (relevant part from that date): 2014-09-16 06:23:30.199+0000: shutting down qemu: terminating on signal 15 from pid 77591 2014-09-16 06:24:50.051+0000: starting up LC_ALL=C PATH=/usr/local/sbin:/usr/local/bin:/usr/bin:/usr/sbin:/sbin:/bin QEMU_AUDIO_DRV=spice /usr/libexec/qemu-kvm -name rhel6 -S -M rhel6.5.0 -cpu SandyBridge -enable-kvm -m 4096 -realtime mlock=off -smp 1,maxcpus=16,sockets=16,cores=1,threads=1 -uuid 974b439c-09d9-4229-a989-692c3f7ce8c6 -smbios type=1,manufacturer=Red Hat,product=RHEV Hypervisor,version=6Server-6.5.0.1.el6,serial=4C4C4544-0059-4410-8054-B4C04F573032,uuid=974b439c-09d9-4229-a989-692c3f7ce8c6 -nodefconfig -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/rhel6.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=2014-09-16T06:24:49,driftfix=slew -no-kvm-pit-reinjection -no-shutdown -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -device virtio-scsi-pci,id=scsi0,bus=pci.0,addr=0x4 -device virtio-serial-pci,id=virtio-serial0,max_ports=16,bus=pci.0,addr=0x5 -drive if=none,media=cdrom,id=drive-ide0-1-0,readonly=on,format=raw,serial= -device ide-drive,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -drive file=/rhev/data-center/00000002-0002-0002-0002-000000000047/48a3ca81-bb3e-46dd-9907-4e84eb1ba23e/images/af26f942-163e-4b0e-a8d5-e2fd06c66afd/eb0d132f-f6f9-44aa-907e-bb97f593310e,if=none,id=drive-virtio-disk0,format=raw,serial=af26f942-163e-4b0e-a8d5-e2fd06c66afd,cache=none,werror=stop,rerror=stop,aio=threads -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x6,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -netdev tap,fd=30,id=hostnet0,vhost=on,vhostfd=31 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=00:1a:4a:16:93:02,bus=pci.0,addr=0x3 -chardev socket,id=charchannel0,path=/var/lib/libvirt/qemu/channels/974b439c-09d9-4229-a989-692c3f7ce8c6.com.redhat.rhevm.vdsm,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=com.redhat.rhevm.vdsm -chardev socket,id=charchannel1,path=/var/lib/libvirt/qemu/channels/974b439c-09d9-4229-a989-692c3f7ce8c6.org.qemu.guest_agent.0,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=2,chardev=charchannel1,id=channel1,name=org.qemu.guest_agent.0 -chardev spicevmc,id=charchannel2,name=vdagent -device virtserialport,bus=virtio-serial0.0,nr=3,chardev=charchannel2,id=channel2,name=com.redhat.spice.0 -spice tls-port=5900,addr=10.35.117.22,x509-dir=/etc/pki/vdsm/libvirt-spice,tls-channel=main,tls-channel=display,tls-channel=inputs,tls-channel=cursor,tls-channel=playback,tls-channel=record,tls-channel=smartcard,tls-channel=usbredir,seamless-migration=on -k en-us -vga qxl -global qxl-vga.ram_size=67108864 -global qxl-vga.vram_size=33554432 -incoming tcp:[::]:49152 -msg timestamp=on I try with libvirt-0.10.2-46.el6.x86_64 step same with comment 16, but result is not. migration succeed. #virsh migrate --live rhel6 qemu+ssh://10.66.85.72/system --verbose --timeout 30 --abort-on-error in another terminal: #iptables -A OUTPUT -d $nfs_server_ip -j REJECT root.85.72's password: Migration: [100 %] on target : # virsh list --all Id Name State ---------------------------------------------------- 1 rhel6 running I will attach libvirtd.log Created attachment 938397 [details]
source libvirtd.log
I didn't had libvirt log, because of this bug: https://bugzilla.redhat.com/show_bug.cgi?id=1141763 sorry , i attach wrong libvirtd.log, it's target host libvirtd.log please ignore comment 30, i will re-attach source host libvirtd.log I re-test this issue. when i add error_policy in guest xml,the result is expect .... <driver name='qemu' type='qcow2' cache='none' error_policy='stop'/> <source file='/var/lib/libvirt/migrate/kvm-rhel6.5-x86_64-qcow2v2.img'> <seclabel model='selinux' relabel='no'/> </source> .... # virsh migrate --live rhel6 qemu+ssh://10.66.85.72/system --verbose --timeout 30 --abort-on-error root.85.72's password: Migration: [ 80 %]error: operation failed: migration job: failed due to I/O error explain for comment 29: error_policy default value is report.so migrate job will continue. verify build: libvirt-0.10.2-46.el6.x86_64 verify step: 1:prepare two host 2:soft mount on both host mount -o soft,timeo=15,retrans=3,nosharecache $nfs_ip:/var/lib/libvirt/images /var/lib/libvirt/migrate/ 3:prepare a guest with xml: ... <driver name='qemu' type='qcow2' cache='none' error_policy='stop'/> <source file='/var/lib/libvirt/migrate/kvm-rhel6.5-x86_64-qcow2v2.img'> <seclabel model='selinux' relabel='no'/> </source> ... 4:do migrate # virsh migrate --live rhel6 qemu+ssh://10.66.85.72/system --verbose --timeout 30 --abort-on-error during migrate, use iptables to block nfs server #iptables -A OUTPUT -d $nfs_server_ip -j REJECT Migration: [ 80 %]error: operation failed: migration job: failed due to I/O error migrate job will stopped. check target host, no guest exist. move to verified. If have any other issue ,please file a new bug. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2014-1374.html |