Bug 695393

Summary: VDSM: Cannot stop VM that was paused due to storage I/O errors as long as the storage remains unavailable
Product: Red Hat Enterprise Linux 6 Reporter: Dafna Ron <dron>
Component: vdsmAssignee: Igor Lvovsky <ilvovsky>
Status: CLOSED ERRATA QA Contact: Dafna Ron <dron>
Severity: urgent Docs Contact:
Priority: high    
Version: 6.1CC: abaron, bazulay, iheim, lpeer, mkenneth, pstehlik, yeylon, ykaul
Target Milestone: rcKeywords: TestBlocker
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard: storage
Fixed In Version: vdsm-4.9-66.el6 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-12-06 07:14:18 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 695102, 706042    
Bug Blocks:    
Attachments:
Description Flags
logs none

Description Dafna Ron 2011-04-11 15:27:00 UTC
Created attachment 491260 [details]
logs

Description of problem:

You cannot stop VM which was paused due to I/O errors as long as storage is still unavailable. 
After qemu and libvirt are killed, vdsm cannot release resources lock since it cannot access the storage and the destroy fails. 

the VM's behaviour differs on weather they are located on the SPM or HSM

for the vm's running on the HSM:
 - they immediately pause due to storage errors and vdsm reboot cleans the locks. 

for vm's running on the SPM:
- they turn to unknown state -> try to migrate -> fail migration -> in host they will appear as pause, in backend they are stuck on migrating state. 
- vdsm restart will not remove the VM's from the host only a complete host reboot will clean the VM's from host but not from backend: you need to activate the host and than stop the VM's which now appear as paused in backend to stop them in backend.  
- there is also a backend bug: 695102 that hosts shows vm count as 0 trying to stop the vm's will result in error "desktop does not exist" 


Version-Release number of selected component (if applicable):
ic108
vdsm-cli-4.9-58.el6.x86_64
vdsm-debug-plugin-4.9-58.el6.x86_64
vdsm-debuginfo-4.9-58.el6.x86_64
vdsm-4.9-58.el6.x86_64
qemu-img-0.12.1.2-2.152.el6.x86_64
qemu-kvm-debuginfo-0.12.1.2-2.152.el6.x86_64
gpxe-roms-qemu-0.9.7-6.4.el6.noarch
qemu-kvm-0.12.1.2-2.152.el6.x86_64
libvirt-python-0.8.7-16.el6.x86_64
libvirt-client-0.8.7-16.el6.x86_64
libvirt-0.8.7-16.el6.x86_64
libvirt-devel-0.8.7-16.el6.x86_64
libvirt-debuginfo-0.8.7-16.el6.x86_64


How reproducible:

100%

Steps to Reproduce:
1. create SD from extended LV - run VM's on 2 host 
2. in the storage, make one of the luns offline
3. when VM's pause due to I/O errors try to stop the VM
  
Actual results:

qemu and libvirt will be killed but destroy VM will fail because vdsm cannot release resource lock. 
VM cannot be stopped and you cannot destroy SD because it has running VM's
So if your storage died you are basically unable to remove the VM's or the SD from the rhevm and host. 
you can release the lock by restarting vdsm/host but:

1) this means that other domains (with running vm's) will also be effected and not just the problematic one.
 
2) a simple "stop VM" task becomes a long and very complicated action for a sys admin (and this is if they are knowledgeable enough in our product to be able to solve it themselves) 

Expected results:

we should be able to release vdsm resource lock without restarting vdsm 

Additional info:logs are attached 

HSM: 

[root@south-01 tmp]# vdsClient -s 0 list table
c27aefde-9b80-4324-b44c-bc0769c88a74   3892  111111               Paused                                   
60c76aec-92d6-4793-9c2a-3a52b3d9cf4b   3770  222222               Paused                                   
[root@south-01 tmp]# virsh 
Welcome to virsh, the virtualization interactive terminal.

Type:  'help' for help with commands
       'quit' to quit

virsh # list
 Id Name                 State
----------------------------------

virsh # ^C
[root@south-01 tmp]# 
[root@south-01 tmp]# ps 3892
  PID TTY      STAT   TIME COMMAND

[root@south-01 tmp]# service vdsmd restart
Shutting down vdsm daemon: 
vdsm watchdog stop                                         [  OK  ]
vdsm stop                                                  [  OK  ]
Restarting netconsole...
Disabling netconsole                                       [  OK  ]
Initializing netconsole                                    [  OK  ]
Starting iscsid: 
Starting up vdsm daemon: 
vdsm start                                                 [  OK  ]
[root@south-01 tmp]# vdsClient -s 0 list table
[root@south-01 tmp]# 



SPM: 

[root@south-02 host_reboot]# vdsClient -s 0 list table
b6f5085c-4f31-4b68-a0a8-f5e2a445eb6c  25494  333333               Paused                                   
af44f765-d691-4273-986f-3412a3648c80  25266  444444               Paused                                   
[root@south-02 host_reboot]# 
[root@south-02 host_reboot]# 
[root@south-02 host_reboot]# 
[root@south-02 host_reboot]# virsh
Welcome to virsh, the virtualization interactive terminal.

Type:  'help' for help with commands
       'quit' to quit

virsh # list
 Id Name                 State
----------------------------------
 35 444444               paused
 36 333333               paused

virsh # ^C
[root@south-02 host_reboot]# ps 25494
  PID TTY      STAT   TIME COMMAND
25494 ?        Sl     2:23 /usr/libexec/qemu-kvm -S -M rhel6.0.0 -cpu Opteron_G2 -enable-nesting -enable-kvm -m 512 -smp 1,sockets=1,cores=1,threads=1 -name 333333 -uuid b6f5085
[root@south-02 host_reboot]# 


[root@south-02 host_reboot]# service vdsmd restart
Shutting down vdsm daemon: 
vdsm watchdog stop                                         [  OK  ]
vdsm stop                                                  [  OK  ]
Restarting netconsole...
Disabling netconsole                                       [  OK  ]
Initializing netconsole                                    [  OK  ]
Starting iscsid: 
Starting up vdsm daemon: 
vdsm start                                                 [  OK  ]
[root@south-02 host_reboot]# vdsClient -s 0 list table
b6f5085c-4f31-4b68-a0a8-f5e2a445eb6c  25494  333333               Paused                                   
af44f765-d691-4273-986f-3412a3648c80  25266  444444               Paused                                   
[root@south-02 host_reboot]# 


host reboot: 

Welcome to a node of the Westford 64-node cluster.

For current system assignments see:
http://intranet.corp.redhat.com/ic/intranet/ClusterNsew.html

For other details of the cluster systems see:
https://wiki.test.redhat.com/ClusterStorage/NsewCluster

The last tree installed was RHEL6.0-20100909.1-Server

[root@south-01 ~]# 
[root@south-01 ~]# vdsClient -s 0 list table
[root@south-01 ~]#

Comment 1 Dan Kenigsberg 2011-04-21 22:31:40 UTC
We fail to teardown a volume without accessing it. I think we should succeed.

It's not a real regression - the previous state, where you could destroy a VM but starting it up would deadlock vdsm is much worse.

Dafna, why is this a test blocker?

Comment 7 Igor Lvovsky 2011-05-04 14:49:47 UTC
http://gerrit.usersys.redhat.com/#change,357

Comment 9 Dafna Ron 2011-05-19 11:27:44 UTC
fixed worked great on one host cluster 
but two host cluster cannot be checked because of bug 706042
blocked until bug 706042 is fixed

Comment 12 Dafna Ron 2011-06-24 10:33:38 UTC
verified on ic127
vdsm-4.9-75.el6.x86_64

Comment 13 errata-xmlrpc 2011-12-06 07:14:18 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHEA-2011-1782.html