Note: This bug is displayed in read-only format because
the product is no longer active in Red Hat Bugzilla.
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Created attachment 491260[details]
logs
Description of problem:
You cannot stop VM which was paused due to I/O errors as long as storage is still unavailable.
After qemu and libvirt are killed, vdsm cannot release resources lock since it cannot access the storage and the destroy fails.
the VM's behaviour differs on weather they are located on the SPM or HSM
for the vm's running on the HSM:
- they immediately pause due to storage errors and vdsm reboot cleans the locks.
for vm's running on the SPM:
- they turn to unknown state -> try to migrate -> fail migration -> in host they will appear as pause, in backend they are stuck on migrating state.
- vdsm restart will not remove the VM's from the host only a complete host reboot will clean the VM's from host but not from backend: you need to activate the host and than stop the VM's which now appear as paused in backend to stop them in backend.
- there is also a backend bug: 695102 that hosts shows vm count as 0 trying to stop the vm's will result in error "desktop does not exist"
Version-Release number of selected component (if applicable):
ic108
vdsm-cli-4.9-58.el6.x86_64
vdsm-debug-plugin-4.9-58.el6.x86_64
vdsm-debuginfo-4.9-58.el6.x86_64
vdsm-4.9-58.el6.x86_64
qemu-img-0.12.1.2-2.152.el6.x86_64
qemu-kvm-debuginfo-0.12.1.2-2.152.el6.x86_64
gpxe-roms-qemu-0.9.7-6.4.el6.noarch
qemu-kvm-0.12.1.2-2.152.el6.x86_64
libvirt-python-0.8.7-16.el6.x86_64
libvirt-client-0.8.7-16.el6.x86_64
libvirt-0.8.7-16.el6.x86_64
libvirt-devel-0.8.7-16.el6.x86_64
libvirt-debuginfo-0.8.7-16.el6.x86_64
How reproducible:
100%
Steps to Reproduce:
1. create SD from extended LV - run VM's on 2 host
2. in the storage, make one of the luns offline
3. when VM's pause due to I/O errors try to stop the VM
Actual results:
qemu and libvirt will be killed but destroy VM will fail because vdsm cannot release resource lock.
VM cannot be stopped and you cannot destroy SD because it has running VM's
So if your storage died you are basically unable to remove the VM's or the SD from the rhevm and host.
you can release the lock by restarting vdsm/host but:
1) this means that other domains (with running vm's) will also be effected and not just the problematic one.
2) a simple "stop VM" task becomes a long and very complicated action for a sys admin (and this is if they are knowledgeable enough in our product to be able to solve it themselves)
Expected results:
we should be able to release vdsm resource lock without restarting vdsm
Additional info:logs are attached
HSM:
[root@south-01 tmp]# vdsClient -s 0 list table
c27aefde-9b80-4324-b44c-bc0769c88a74 3892 111111 Paused
60c76aec-92d6-4793-9c2a-3a52b3d9cf4b 3770 222222 Paused
[root@south-01 tmp]# virsh
Welcome to virsh, the virtualization interactive terminal.
Type: 'help' for help with commands
'quit' to quit
virsh # list
Id Name State
----------------------------------
virsh # ^C
[root@south-01 tmp]#
[root@south-01 tmp]# ps 3892
PID TTY STAT TIME COMMAND
[root@south-01 tmp]# service vdsmd restart
Shutting down vdsm daemon:
vdsm watchdog stop [ OK ]
vdsm stop [ OK ]
Restarting netconsole...
Disabling netconsole [ OK ]
Initializing netconsole [ OK ]
Starting iscsid:
Starting up vdsm daemon:
vdsm start [ OK ]
[root@south-01 tmp]# vdsClient -s 0 list table
[root@south-01 tmp]#
SPM:
[root@south-02 host_reboot]# vdsClient -s 0 list table
b6f5085c-4f31-4b68-a0a8-f5e2a445eb6c 25494 333333 Paused
af44f765-d691-4273-986f-3412a3648c80 25266 444444 Paused
[root@south-02 host_reboot]#
[root@south-02 host_reboot]#
[root@south-02 host_reboot]#
[root@south-02 host_reboot]# virsh
Welcome to virsh, the virtualization interactive terminal.
Type: 'help' for help with commands
'quit' to quit
virsh # list
Id Name State
----------------------------------
35 444444 paused
36 333333 paused
virsh # ^C
[root@south-02 host_reboot]# ps 25494
PID TTY STAT TIME COMMAND
25494 ? Sl 2:23 /usr/libexec/qemu-kvm -S -M rhel6.0.0 -cpu Opteron_G2 -enable-nesting -enable-kvm -m 512 -smp 1,sockets=1,cores=1,threads=1 -name 333333 -uuid b6f5085
[root@south-02 host_reboot]#
[root@south-02 host_reboot]# service vdsmd restart
Shutting down vdsm daemon:
vdsm watchdog stop [ OK ]
vdsm stop [ OK ]
Restarting netconsole...
Disabling netconsole [ OK ]
Initializing netconsole [ OK ]
Starting iscsid:
Starting up vdsm daemon:
vdsm start [ OK ]
[root@south-02 host_reboot]# vdsClient -s 0 list table
b6f5085c-4f31-4b68-a0a8-f5e2a445eb6c 25494 333333 Paused
af44f765-d691-4273-986f-3412a3648c80 25266 444444 Paused
[root@south-02 host_reboot]#
host reboot:
Welcome to a node of the Westford 64-node cluster.
For current system assignments see:
http://intranet.corp.redhat.com/ic/intranet/ClusterNsew.html
For other details of the cluster systems see:
https://wiki.test.redhat.com/ClusterStorage/NsewCluster
The last tree installed was RHEL6.0-20100909.1-Server
[root@south-01 ~]#
[root@south-01 ~]# vdsClient -s 0 list table
[root@south-01 ~]#
We fail to teardown a volume without accessing it. I think we should succeed.
It's not a real regression - the previous state, where you could destroy a VM but starting it up would deadlock vdsm is much worse.
Dafna, why is this a test blocker?
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.
http://rhn.redhat.com/errata/RHEA-2011-1782.html