Bug 860222 - fail to cancel migration with Unexpected exception when storage is blocked in source
Summary: fail to cancel migration with Unexpected exception when storage is blocked in...
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm
Version: 3.1.0
Hardware: x86_64
OS: Linux
medium
high
Target Milestone: ---
: 3.6.0
Assignee: Shahar Havivi
QA Contact: Israel Pinto
URL:
Whiteboard: virt
Depends On: 697277
Blocks:
TreeView+ depends on / blocked
 
Reported: 2012-09-25 09:57 UTC by Dafna Ron
Modified: 2015-09-22 13:10 UTC (History)
13 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2015-09-02 10:24:02 UTC
oVirt Team: ---
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
logs (1.95 MB, application/x-gzip)
2012-09-25 09:57 UTC, Dafna Ron
no flags Details
sanlock log (23.36 KB, application/x-xz)
2012-09-25 10:24 UTC, Dafna Ron
no flags Details
logs from si20 (563.00 KB, application/x-gzip)
2012-10-17 09:10 UTC, Dafna Ron
no flags Details
NonSPM (215.38 KB, image/png)
2015-08-31 13:23 UTC, Shahar Havivi
no flags Details
CancelSPM (209.68 KB, image/png)
2015-08-31 13:24 UTC, Shahar Havivi
no flags Details

Description Dafna Ron 2012-09-25 09:57:39 UTC
Created attachment 616949 [details]
logs

Description of problem:

I tried the new "Cancel Migration" button while storage was blocked with iptables at source host and failed with unexpected exception in vds

Version-Release number of selected component (if applicable):

vdsm-4.9.6-34.0.el6_3.x86_64
libvirt-0.9.10-21.el6_3.4.x86_64
si18.1

How reproducible:

100%

Steps to Reproduce:
1. in two hosts cluster with one NFS storage domain block connectivity to the SD using iptables
2. once vm's start to migrate press the "Cancel Migration" button in webadmin
3.
  
Actual results:

we fail to cancel migration with unexpected exception. 

Expected results:

we should not fail and if we do we should have a correct error code for user.
 
Additional info: vdsm and backend log

Comment 1 Dafna Ron 2012-09-25 10:23:32 UTC
attaching also sanlock log and adding exception: 


backend: 

2012-09-25 11:51:40,714 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase] (pool-4-thread-48) [7939ae2] Failed in CancelMigrateVDS method
2012-09-25 11:51:40,714 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase] (pool-4-thread-48) [7939ae2] Error code unexpected and error message VDSGenericExcep
tion: VDSErrorException: Failed to CancelMigrateVDS, error = Unexpected exception
2012-09-25 11:51:40,714 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase] (pool-4-thread-48) [7939ae2] Command org.ovirt.engine.core.vdsbroker.vdsbroker.Cance
lMigrateVDSCommand return value 
 Class Name: org.ovirt.engine.core.vdsbroker.vdsbroker.StatusOnlyReturnForXmlRpc
mStatus                       Class Name: org.ovirt.engine.core.vdsbroker.vdsbroker.StatusForXmlRpc
mCode                         16
mMessage                      Unexpected exception


2012-09-25 11:51:40,715 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase] (pool-4-thread-48) [7939ae2] Vds: gold-vdsc
2012-09-25 11:51:40,715 ERROR [org.ovirt.engine.core.vdsbroker.VDSCommandBase] (pool-4-thread-48) [7939ae2] Command CancelMigrateVDS execution failed. Exception: VDSErrorExcept
ion: VDSGenericException: VDSErrorException: Failed to CancelMigrateVDS, error = Unexpected exception
2012-09-25 11:51:40,715 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.CancelMigrateVDSCommand] (pool-4-thread-48) [7939ae2] FINISH, CancelMigrateVDSCommand, log id: 12185d88
2012-09-25 11:51:40,715 ERROR [org.ovirt.engine.core.bll.CancelMigrateVmCommand] (pool-4-thread-48) [7939ae2] Command org.ovirt.engine.core.bll.CancelMigrateVmCommand throw Vdc
 Bll exception. With error message VdcBLLException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to CancelMigrate
VDS, error = Unexpected exception
2012-09-25 11:51:40,717 ERROR [org.ovirt.engine.core.bll.CancelMigrateVmCommand] (pool-4-thread-48) [7939ae2] Transaction rolled-back for command: org.ovirt.engine.core.bll.Can
celMigrateVmCommand.




vdsm: 

clientIFinit::DEBUG::2012-09-25 11:51:38,366::task::978::TaskManager.Task::(_decref) Task=`669825b0-2da0-4b2f-8b91-f8083b8fc70c`::ref 0 aborting False
Thread-175::ERROR::2012-09-25 11:51:40,001::BindingXMLRPC::873::vds::(wrapper) libvirt error
Traceback (most recent call last):
  File "/usr/share/vdsm/BindingXMLRPC.py", line 869, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/BindingXMLRPC.py", line 246, in vmMigrationCancel
    return vm.migrateCancel()
  File "/usr/share/vdsm/API.py", line 446, in migrateCancel
    return v.migrateCancel()
  File "/usr/share/vdsm/vm.py", line 1065, in migrateCancel
    self._migrationSourceThread.stop()
  File "/usr/share/vdsm/libvirtvm.py", line 465, in stop
    self._vm._dom.abortJob()
  File "/usr/share/vdsm/libvirtvm.py", line 502, in f
    raise toe
TimeoutError: Timed out during operation: cannot acquire state change lock
Thread-153::INFO::2012-09-25 11:51:41,013::libvirtvm::388::vm.Vm::(run) vmId=`b88a8bb6-605c-4ce5-b50a-9bfdefc6aa8d`::Migration Progress: 74 seconds elapsed, 0% of data processe
d, 0% of mem processed
Thread-149::INFO::2012-09-25 11:51:41,013::libvirtvm::388::vm.Vm::(run) vmId=`dfa58ab9-912a-411d-9a8c-6c299ecae568`::Migration Progress: 75 seconds elapsed, 0% of data processe
d, 0% of mem processed
Thread-190::DEBUG::2012-09-25 11:51:42,489::task::588::TaskManager.Task::(_updateState) Task=`1424d361-c4aa-4e3f-86d1-3e5e89f91b3e`::moving from state init -> state preparing
Thread-190::INFO::2012-09-25 11:51:42,490::logUtils::37::dispatcher::(wrapper) Run and protect: repoStats(options=None)
Thread-190::INFO::2012-09-25 11:51:42,490::logUtils::39::dispatcher::(wrapper) Run and protect: repoStats, Return response: {}

Comment 2 Dafna Ron 2012-09-25 10:24:53 UTC
Created attachment 616956 [details]
sanlock log

Comment 3 Shahar Havivi 2012-09-27 14:42:54 UTC
it is unexpectedness exception... and I don't know if we need to handle that.

vdsm can return the error code 'nfsErr' which represent the error 'Image repository access timeout'

Dan what do you think?

Comment 4 Dan Kenigsberg 2012-10-02 09:13:43 UTC
I'm not sure how urgent it is, but something like creating a new "failed to cancel migration" error, or something more specific as "Vm nonresponding error" makes more sense to me.

Comment 5 Shahar Havivi 2012-10-03 11:34:59 UTC
Dafna,
what do you see in the audit log?

can you post the iptable rule that you use?

did you block the iptable before the migration or during the migration?

Comment 6 Dafna Ron 2012-10-03 11:49:03 UTC
(In reply to comment #5)
> Dafna,
> what do you see in the audit log?
> 
> can you post the iptable rule that you use?
> 
> did you block the iptable before the migration or during the migration?

audit logs shows that cancel migration failed - no info at all so the user has to look in the logs

iptables -A OUTPUT -d <server>  -j DROP

the migration starts because the storage is blocked by iptables (rhevm triggers the migration not the user)

Comment 9 Shahar Havivi 2012-10-15 15:07:17 UTC
Dafan,
I cannot reproduce this scenario,
When you block the two hosts from the storage you are not suppose to have migration or SPM selection (if you only have this 2 hosts active in your cluster).
Is that the case for you, or you had other hosts?

Comment 12 Dafna Ron 2012-10-15 15:41:39 UTC
(In reply to comment #9)
> Dafan,
> I cannot reproduce this scenario,
> When you block the two hosts from the storage you are not suppose to have
> migration or SPM selection (if you only have this 2 hosts active in your
> cluster).
> Is that the case for you, or you had other hosts?

spm selection and migration always happens when spm is not able to connect to the storage. spm selection will continue to all hosts on cluster until the last one which will release SPM (all other hosts will become non-operational).  
I even had a bug that we stop trying to migrate when we have 3 hosts and we fail migration on the second one. 
 
please make sure you are working with latest vdsm/qemu/libvirt packages on rhel6 hosts.

Comment 13 Shahar Havivi 2012-10-15 15:52:12 UTC
(In reply to comment #12)
I have the latest version, after talking to Roy and Omer we agreed that this is the right behavior ie if you are getting SPM selection or migration when all cluster hosts cannot access the storage this is a different bug.

Please reproduce it and show me on your environment.

Comment 14 Dafna Ron 2012-10-17 09:09:57 UTC
I reproduced again in si20
steps: 
2 hosts with one NFS storage domain 
ran vms on spm -> blocked connectivity to the storage domain from the host
vm's started migrating
I cancelled migration on one of the vms

first time I tried to cancel the migration vdsm reports operation succeeded but backend reports Fatal error during migration and tried to migrate the vm again with InternalMigrateVmCommand (even log reported that it is trying to migrate to a different host although I only have two). 

vdsm reports success: 

Thread-186::ERROR::2012-10-17 10:35:47,447::vm::243::vm.Vm::(run) vmId=`4406c46b-7452-43f7-aa51-8a751c2e66c9`::Failed to migrate
Traceback (most recent call last):
  File "/usr/share/vdsm/vm.py", line 226, in run
    self._startUnderlyingMigration()
  File "/usr/share/vdsm/libvirtvm.py", line 454, in _startUnderlyingMigration
    None, maxBandwidth)
  File "/usr/share/vdsm/libvirtvm.py", line 494, in f
    ret = attr(*args, **kwargs)
  File "/usr/lib64/python2.6/site-packages/vdsm/libvirtconnection.py", line 82, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib64/python2.6/site-packages/libvirt.py", line 1104, in migrateToURI2
    if ret == -1: raise libvirtError ('virDomainMigrateToURI2() failed', dom=self)
libvirtError: operation aborted: migration job: canceled by client


rhevm log reports fatal error: 

012-10-17 10:45:13,141 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase] (pool-4-thread-46) [f71d47f] Error code migrateErr and error message VDSGenericException: VDSErrorException: Failed to MigrateStatusVDS, error =
 Fatal error during migration
2012-10-17 10:45:13,141 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase] (pool-4-thread-46) [f71d47f] Command org.ovirt.engine.core.vdsbroker.vdsbroker.MigrateStatusVDSCommand return value 
 Class Name: org.ovirt.engine.core.vdsbroker.vdsbroker.StatusOnlyReturnForXmlRpc
mStatus                       Class Name: org.ovirt.engine.core.vdsbroker.vdsbroker.StatusForXmlRpc
mCode                         12
mMessage                      Fatal error during migration



second time I tried to cancel the migration I got the unexpected error again 

vdsm reports time out: 

Thread-3116::ERROR::2012-10-17 10:37:20,019::BindingXMLRPC::883::vds::(wrapper) libvirt error
Traceback (most recent call last):
  File "/usr/share/vdsm/BindingXMLRPC.py", line 879, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/BindingXMLRPC.py", line 256, in vmMigrationCancel
    return vm.migrateCancel()
  File "/usr/share/vdsm/API.py", line 446, in migrateCancel
    return v.migrateCancel()
  File "/usr/share/vdsm/vm.py", line 1069, in migrateCancel
    self._migrationSourceThread.stop()
  File "/usr/share/vdsm/libvirtvm.py", line 465, in stop
    self._vm._dom.abortJob()
  File "/usr/share/vdsm/libvirtvm.py", line 502, in f
    raise toe
TimeoutError: Timed out during operation: cannot acquire state change lock


rhevm reports unexpected exception: 

2012-10-17 10:46:44,409 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase] (pool-4-thread-48) [56adcb44] Error code unexpected and error message VDSGenericException: VDSErrorException: Failed to CancelMigrateVDS, error = Unexpected exception
2012-10-17 10:46:44,409 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase] (pool-4-thread-48) [56adcb44] Command org.ovirt.engine.core.vdsbroker.vdsbroker.CancelMigrateVDSCommand return value 
 Class Name: org.ovirt.engine.core.vdsbroker.vdsbroker.StatusOnlyReturnForXmlRpc
mStatus                       Class Name: org.ovirt.engine.core.vdsbroker.vdsbroker.StatusForXmlRpc
mCode                         16
mMessage                      Unexpected exception


2012-10-17 10:46:44,409 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase] (pool-4-thread-48) [56adcb44] HostName = gold-vdsd
2012-10-17 10:46:44,409 ERROR [org.ovirt.engine.core.vdsbroker.VDSCommandBase] (pool-4-thread-48) [56adcb44] Command CancelMigrateVDS execution failed. Exception: VDSErrorException: VDSGenericException: VDSErrorException: Failed to CancelMigrateVDS, error = Unexpected exception


attaching full logs again

Comment 15 Dafna Ron 2012-10-17 09:10:32 UTC
Created attachment 628616 [details]
logs from si20

Comment 16 Shahar Havivi 2012-10-17 13:02:39 UTC
Thanks Dafna,
its look like a mix error, when we start the underline migration vdsm calling to virDomainMigrateToURI2() for libvirt to start migration, since we block the data domain we are in a block mode here and the error that we raise is Fatal (its time out for the domain)

The problem is that in the meantime we are calling cancel migration and returning the error of the timeout of the first call (the migrate call)

I think we don't need to raise an error if the migration already raise one..

Comment 17 Michal Skrivanek 2012-10-26 06:45:23 UTC
867347 may actually change the behavior. We may need to retest in next build...

Comment 18 Omer Frenkel 2012-11-26 12:02:33 UTC
according to Comment 17 please retest now that bug 867347 is fixed

Comment 24 Michal Skrivanek 2013-09-13 09:04:30 UTC
is anything still going on regarding this bug?

Comment 25 Omer Frenkel 2013-09-15 08:06:59 UTC
eventually i didnt start working on solving this, moving back to new

Comment 27 Michal Skrivanek 2014-01-31 10:46:12 UTC
bug 697277 may help here a lot

Comment 29 Shahar Havivi 2015-08-30 09:11:24 UTC
tested with source host as spm.

when blocking the destination host from storage we are getting the following error in audit log and engine log:
**********************************
Migration failed  (VM: win10, Source: luna, Destination: shaharh)
**********************************
which looks like a good behavior.

when blocking the source host (ie the spm) we are getting a lot of spm connectivity errors in the engine log and when and we are getting the following error in the vms audit log:
**********************************
VM win10 was set to the Unknown status.
**********************************
and you don't have a functioning cluster since you don't have spm.

in both cases when unblocking the storage the vm return to a running state.

Comment 30 Shahar Havivi 2015-08-31 13:23:42 UTC
Created attachment 1068644 [details]
NonSPM

Blocking the destination (Not the SPM)

Comment 31 Shahar Havivi 2015-08-31 13:24:26 UTC
Created attachment 1068645 [details]
CancelSPM

Blocking the source (the SPM)

Comment 32 Shahar Havivi 2015-08-31 13:37:47 UTC
Attached two screenshots,
If the migration is in virDomainMigrateToURI3 and we are blocking the storage the vdsm thread is in blocking mode until the storage connection timeout,
if we are trying to cancel the migration (ie calling self._vm._dom.abortJob()) the call will not take place until the timeout is ended.
The user will see the errors as shown in the CancelSPM and NonSPM screenshots.

The vdsm logs show:
source log:
***************************
Thread-296::ERROR::2015-08-31 15:50:48,074::migration::310::virt.vm::(run) vmId=`6f120e6a-5944-473f-a7ae-350936b59c55`::Failed to migrate
Traceback (most recent call last):
  File "/usr/share/vdsm/virt/migration.py", line 294, in run
    self._startUnderlyingMigration(time.time())
  File "/usr/share/vdsm/virt/migration.py", line 364, in _startUnderlyingMigration
    self._perform_migration(duri, muri)
  File "/usr/share/vdsm/virt/migration.py", line 403, in _perform_migration
    self._vm._dom.migrateToURI3(duri, params, flags)
  File "/usr/share/vdsm/virt/vm.py", line 231, in f
    ret = attr(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/libvirtconnection.py", line 124, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib64/python2.7/site-packages/libvirt.py", line 1836, in migrateToURI3
***************************

dest log:
***************************
Thread-51282::ERROR::2015-08-31 15:47:06,470::vm::780::virt.vm::(_startUnderlyingVm) vmId=`6f120e6a-5944-473f-a7ae-350936b59c55`::Failed to start a migration destination vm
Traceback (most recent call last):
  File "/usr/share/vdsm/virt/vm.py", line 750, in _startUnderlyingVm
    self._completeIncomingMigration()
  File "/usr/share/vdsm/virt/vm.py", line 2732, in _completeIncomingMigration
    self._incomingMigrationFinished.isSet(), usedTimeout)
  File "/usr/share/vdsm/virt/vm.py", line 2785, in _attachLibvirtDomainAfterMigration
    raise MigrationError(e.get_error_message())
***************************

Comment 33 Michal Skrivanek 2015-09-01 15:40:50 UTC
It behaves well. But the message is a bit confusing(cancel failed) when it actually is queued and correctly canceled when storage is restored


Note You need to log in before you can comment on or make changes to this bug.