Hide Forgot
Created attachment 491260 [details] logs Description of problem: You cannot stop VM which was paused due to I/O errors as long as storage is still unavailable. After qemu and libvirt are killed, vdsm cannot release resources lock since it cannot access the storage and the destroy fails. the VM's behaviour differs on weather they are located on the SPM or HSM for the vm's running on the HSM: - they immediately pause due to storage errors and vdsm reboot cleans the locks. for vm's running on the SPM: - they turn to unknown state -> try to migrate -> fail migration -> in host they will appear as pause, in backend they are stuck on migrating state. - vdsm restart will not remove the VM's from the host only a complete host reboot will clean the VM's from host but not from backend: you need to activate the host and than stop the VM's which now appear as paused in backend to stop them in backend. - there is also a backend bug: 695102 that hosts shows vm count as 0 trying to stop the vm's will result in error "desktop does not exist" Version-Release number of selected component (if applicable): ic108 vdsm-cli-4.9-58.el6.x86_64 vdsm-debug-plugin-4.9-58.el6.x86_64 vdsm-debuginfo-4.9-58.el6.x86_64 vdsm-4.9-58.el6.x86_64 qemu-img-0.12.1.2-2.152.el6.x86_64 qemu-kvm-debuginfo-0.12.1.2-2.152.el6.x86_64 gpxe-roms-qemu-0.9.7-6.4.el6.noarch qemu-kvm-0.12.1.2-2.152.el6.x86_64 libvirt-python-0.8.7-16.el6.x86_64 libvirt-client-0.8.7-16.el6.x86_64 libvirt-0.8.7-16.el6.x86_64 libvirt-devel-0.8.7-16.el6.x86_64 libvirt-debuginfo-0.8.7-16.el6.x86_64 How reproducible: 100% Steps to Reproduce: 1. create SD from extended LV - run VM's on 2 host 2. in the storage, make one of the luns offline 3. when VM's pause due to I/O errors try to stop the VM Actual results: qemu and libvirt will be killed but destroy VM will fail because vdsm cannot release resource lock. VM cannot be stopped and you cannot destroy SD because it has running VM's So if your storage died you are basically unable to remove the VM's or the SD from the rhevm and host. you can release the lock by restarting vdsm/host but: 1) this means that other domains (with running vm's) will also be effected and not just the problematic one. 2) a simple "stop VM" task becomes a long and very complicated action for a sys admin (and this is if they are knowledgeable enough in our product to be able to solve it themselves) Expected results: we should be able to release vdsm resource lock without restarting vdsm Additional info:logs are attached HSM: [root@south-01 tmp]# vdsClient -s 0 list table c27aefde-9b80-4324-b44c-bc0769c88a74 3892 111111 Paused 60c76aec-92d6-4793-9c2a-3a52b3d9cf4b 3770 222222 Paused [root@south-01 tmp]# virsh Welcome to virsh, the virtualization interactive terminal. Type: 'help' for help with commands 'quit' to quit virsh # list Id Name State ---------------------------------- virsh # ^C [root@south-01 tmp]# [root@south-01 tmp]# ps 3892 PID TTY STAT TIME COMMAND [root@south-01 tmp]# service vdsmd restart Shutting down vdsm daemon: vdsm watchdog stop [ OK ] vdsm stop [ OK ] Restarting netconsole... Disabling netconsole [ OK ] Initializing netconsole [ OK ] Starting iscsid: Starting up vdsm daemon: vdsm start [ OK ] [root@south-01 tmp]# vdsClient -s 0 list table [root@south-01 tmp]# SPM: [root@south-02 host_reboot]# vdsClient -s 0 list table b6f5085c-4f31-4b68-a0a8-f5e2a445eb6c 25494 333333 Paused af44f765-d691-4273-986f-3412a3648c80 25266 444444 Paused [root@south-02 host_reboot]# [root@south-02 host_reboot]# [root@south-02 host_reboot]# [root@south-02 host_reboot]# virsh Welcome to virsh, the virtualization interactive terminal. Type: 'help' for help with commands 'quit' to quit virsh # list Id Name State ---------------------------------- 35 444444 paused 36 333333 paused virsh # ^C [root@south-02 host_reboot]# ps 25494 PID TTY STAT TIME COMMAND 25494 ? Sl 2:23 /usr/libexec/qemu-kvm -S -M rhel6.0.0 -cpu Opteron_G2 -enable-nesting -enable-kvm -m 512 -smp 1,sockets=1,cores=1,threads=1 -name 333333 -uuid b6f5085 [root@south-02 host_reboot]# [root@south-02 host_reboot]# service vdsmd restart Shutting down vdsm daemon: vdsm watchdog stop [ OK ] vdsm stop [ OK ] Restarting netconsole... Disabling netconsole [ OK ] Initializing netconsole [ OK ] Starting iscsid: Starting up vdsm daemon: vdsm start [ OK ] [root@south-02 host_reboot]# vdsClient -s 0 list table b6f5085c-4f31-4b68-a0a8-f5e2a445eb6c 25494 333333 Paused af44f765-d691-4273-986f-3412a3648c80 25266 444444 Paused [root@south-02 host_reboot]# host reboot: Welcome to a node of the Westford 64-node cluster. For current system assignments see: http://intranet.corp.redhat.com/ic/intranet/ClusterNsew.html For other details of the cluster systems see: https://wiki.test.redhat.com/ClusterStorage/NsewCluster The last tree installed was RHEL6.0-20100909.1-Server [root@south-01 ~]# [root@south-01 ~]# vdsClient -s 0 list table [root@south-01 ~]#
We fail to teardown a volume without accessing it. I think we should succeed. It's not a real regression - the previous state, where you could destroy a VM but starting it up would deadlock vdsm is much worse. Dafna, why is this a test blocker?
http://gerrit.usersys.redhat.com/#change,357
fixed worked great on one host cluster but two host cluster cannot be checked because of bug 706042 blocked until bug 706042 is fixed
verified on ic127 vdsm-4.9-75.el6.x86_64
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHEA-2011-1782.html