Created attachment 616949 [details] logs Description of problem: I tried the new "Cancel Migration" button while storage was blocked with iptables at source host and failed with unexpected exception in vds Version-Release number of selected component (if applicable): vdsm-4.9.6-34.0.el6_3.x86_64 libvirt-0.9.10-21.el6_3.4.x86_64 si18.1 How reproducible: 100% Steps to Reproduce: 1. in two hosts cluster with one NFS storage domain block connectivity to the SD using iptables 2. once vm's start to migrate press the "Cancel Migration" button in webadmin 3. Actual results: we fail to cancel migration with unexpected exception. Expected results: we should not fail and if we do we should have a correct error code for user. Additional info: vdsm and backend log
attaching also sanlock log and adding exception: backend: 2012-09-25 11:51:40,714 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase] (pool-4-thread-48) [7939ae2] Failed in CancelMigrateVDS method 2012-09-25 11:51:40,714 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase] (pool-4-thread-48) [7939ae2] Error code unexpected and error message VDSGenericExcep tion: VDSErrorException: Failed to CancelMigrateVDS, error = Unexpected exception 2012-09-25 11:51:40,714 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase] (pool-4-thread-48) [7939ae2] Command org.ovirt.engine.core.vdsbroker.vdsbroker.Cance lMigrateVDSCommand return value Class Name: org.ovirt.engine.core.vdsbroker.vdsbroker.StatusOnlyReturnForXmlRpc mStatus Class Name: org.ovirt.engine.core.vdsbroker.vdsbroker.StatusForXmlRpc mCode 16 mMessage Unexpected exception 2012-09-25 11:51:40,715 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase] (pool-4-thread-48) [7939ae2] Vds: gold-vdsc 2012-09-25 11:51:40,715 ERROR [org.ovirt.engine.core.vdsbroker.VDSCommandBase] (pool-4-thread-48) [7939ae2] Command CancelMigrateVDS execution failed. Exception: VDSErrorExcept ion: VDSGenericException: VDSErrorException: Failed to CancelMigrateVDS, error = Unexpected exception 2012-09-25 11:51:40,715 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.CancelMigrateVDSCommand] (pool-4-thread-48) [7939ae2] FINISH, CancelMigrateVDSCommand, log id: 12185d88 2012-09-25 11:51:40,715 ERROR [org.ovirt.engine.core.bll.CancelMigrateVmCommand] (pool-4-thread-48) [7939ae2] Command org.ovirt.engine.core.bll.CancelMigrateVmCommand throw Vdc Bll exception. With error message VdcBLLException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to CancelMigrate VDS, error = Unexpected exception 2012-09-25 11:51:40,717 ERROR [org.ovirt.engine.core.bll.CancelMigrateVmCommand] (pool-4-thread-48) [7939ae2] Transaction rolled-back for command: org.ovirt.engine.core.bll.Can celMigrateVmCommand. vdsm: clientIFinit::DEBUG::2012-09-25 11:51:38,366::task::978::TaskManager.Task::(_decref) Task=`669825b0-2da0-4b2f-8b91-f8083b8fc70c`::ref 0 aborting False Thread-175::ERROR::2012-09-25 11:51:40,001::BindingXMLRPC::873::vds::(wrapper) libvirt error Traceback (most recent call last): File "/usr/share/vdsm/BindingXMLRPC.py", line 869, in wrapper res = f(*args, **kwargs) File "/usr/share/vdsm/BindingXMLRPC.py", line 246, in vmMigrationCancel return vm.migrateCancel() File "/usr/share/vdsm/API.py", line 446, in migrateCancel return v.migrateCancel() File "/usr/share/vdsm/vm.py", line 1065, in migrateCancel self._migrationSourceThread.stop() File "/usr/share/vdsm/libvirtvm.py", line 465, in stop self._vm._dom.abortJob() File "/usr/share/vdsm/libvirtvm.py", line 502, in f raise toe TimeoutError: Timed out during operation: cannot acquire state change lock Thread-153::INFO::2012-09-25 11:51:41,013::libvirtvm::388::vm.Vm::(run) vmId=`b88a8bb6-605c-4ce5-b50a-9bfdefc6aa8d`::Migration Progress: 74 seconds elapsed, 0% of data processe d, 0% of mem processed Thread-149::INFO::2012-09-25 11:51:41,013::libvirtvm::388::vm.Vm::(run) vmId=`dfa58ab9-912a-411d-9a8c-6c299ecae568`::Migration Progress: 75 seconds elapsed, 0% of data processe d, 0% of mem processed Thread-190::DEBUG::2012-09-25 11:51:42,489::task::588::TaskManager.Task::(_updateState) Task=`1424d361-c4aa-4e3f-86d1-3e5e89f91b3e`::moving from state init -> state preparing Thread-190::INFO::2012-09-25 11:51:42,490::logUtils::37::dispatcher::(wrapper) Run and protect: repoStats(options=None) Thread-190::INFO::2012-09-25 11:51:42,490::logUtils::39::dispatcher::(wrapper) Run and protect: repoStats, Return response: {}
Created attachment 616956 [details] sanlock log
it is unexpectedness exception... and I don't know if we need to handle that. vdsm can return the error code 'nfsErr' which represent the error 'Image repository access timeout' Dan what do you think?
I'm not sure how urgent it is, but something like creating a new "failed to cancel migration" error, or something more specific as "Vm nonresponding error" makes more sense to me.
Dafna, what do you see in the audit log? can you post the iptable rule that you use? did you block the iptable before the migration or during the migration?
(In reply to comment #5) > Dafna, > what do you see in the audit log? > > can you post the iptable rule that you use? > > did you block the iptable before the migration or during the migration? audit logs shows that cancel migration failed - no info at all so the user has to look in the logs iptables -A OUTPUT -d <server> -j DROP the migration starts because the storage is blocked by iptables (rhevm triggers the migration not the user)
Dafan, I cannot reproduce this scenario, When you block the two hosts from the storage you are not suppose to have migration or SPM selection (if you only have this 2 hosts active in your cluster). Is that the case for you, or you had other hosts?
(In reply to comment #9) > Dafan, > I cannot reproduce this scenario, > When you block the two hosts from the storage you are not suppose to have > migration or SPM selection (if you only have this 2 hosts active in your > cluster). > Is that the case for you, or you had other hosts? spm selection and migration always happens when spm is not able to connect to the storage. spm selection will continue to all hosts on cluster until the last one which will release SPM (all other hosts will become non-operational). I even had a bug that we stop trying to migrate when we have 3 hosts and we fail migration on the second one. please make sure you are working with latest vdsm/qemu/libvirt packages on rhel6 hosts.
(In reply to comment #12) I have the latest version, after talking to Roy and Omer we agreed that this is the right behavior ie if you are getting SPM selection or migration when all cluster hosts cannot access the storage this is a different bug. Please reproduce it and show me on your environment.
I reproduced again in si20 steps: 2 hosts with one NFS storage domain ran vms on spm -> blocked connectivity to the storage domain from the host vm's started migrating I cancelled migration on one of the vms first time I tried to cancel the migration vdsm reports operation succeeded but backend reports Fatal error during migration and tried to migrate the vm again with InternalMigrateVmCommand (even log reported that it is trying to migrate to a different host although I only have two). vdsm reports success: Thread-186::ERROR::2012-10-17 10:35:47,447::vm::243::vm.Vm::(run) vmId=`4406c46b-7452-43f7-aa51-8a751c2e66c9`::Failed to migrate Traceback (most recent call last): File "/usr/share/vdsm/vm.py", line 226, in run self._startUnderlyingMigration() File "/usr/share/vdsm/libvirtvm.py", line 454, in _startUnderlyingMigration None, maxBandwidth) File "/usr/share/vdsm/libvirtvm.py", line 494, in f ret = attr(*args, **kwargs) File "/usr/lib64/python2.6/site-packages/vdsm/libvirtconnection.py", line 82, in wrapper ret = f(*args, **kwargs) File "/usr/lib64/python2.6/site-packages/libvirt.py", line 1104, in migrateToURI2 if ret == -1: raise libvirtError ('virDomainMigrateToURI2() failed', dom=self) libvirtError: operation aborted: migration job: canceled by client rhevm log reports fatal error: 012-10-17 10:45:13,141 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase] (pool-4-thread-46) [f71d47f] Error code migrateErr and error message VDSGenericException: VDSErrorException: Failed to MigrateStatusVDS, error = Fatal error during migration 2012-10-17 10:45:13,141 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase] (pool-4-thread-46) [f71d47f] Command org.ovirt.engine.core.vdsbroker.vdsbroker.MigrateStatusVDSCommand return value Class Name: org.ovirt.engine.core.vdsbroker.vdsbroker.StatusOnlyReturnForXmlRpc mStatus Class Name: org.ovirt.engine.core.vdsbroker.vdsbroker.StatusForXmlRpc mCode 12 mMessage Fatal error during migration second time I tried to cancel the migration I got the unexpected error again vdsm reports time out: Thread-3116::ERROR::2012-10-17 10:37:20,019::BindingXMLRPC::883::vds::(wrapper) libvirt error Traceback (most recent call last): File "/usr/share/vdsm/BindingXMLRPC.py", line 879, in wrapper res = f(*args, **kwargs) File "/usr/share/vdsm/BindingXMLRPC.py", line 256, in vmMigrationCancel return vm.migrateCancel() File "/usr/share/vdsm/API.py", line 446, in migrateCancel return v.migrateCancel() File "/usr/share/vdsm/vm.py", line 1069, in migrateCancel self._migrationSourceThread.stop() File "/usr/share/vdsm/libvirtvm.py", line 465, in stop self._vm._dom.abortJob() File "/usr/share/vdsm/libvirtvm.py", line 502, in f raise toe TimeoutError: Timed out during operation: cannot acquire state change lock rhevm reports unexpected exception: 2012-10-17 10:46:44,409 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase] (pool-4-thread-48) [56adcb44] Error code unexpected and error message VDSGenericException: VDSErrorException: Failed to CancelMigrateVDS, error = Unexpected exception 2012-10-17 10:46:44,409 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase] (pool-4-thread-48) [56adcb44] Command org.ovirt.engine.core.vdsbroker.vdsbroker.CancelMigrateVDSCommand return value Class Name: org.ovirt.engine.core.vdsbroker.vdsbroker.StatusOnlyReturnForXmlRpc mStatus Class Name: org.ovirt.engine.core.vdsbroker.vdsbroker.StatusForXmlRpc mCode 16 mMessage Unexpected exception 2012-10-17 10:46:44,409 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase] (pool-4-thread-48) [56adcb44] HostName = gold-vdsd 2012-10-17 10:46:44,409 ERROR [org.ovirt.engine.core.vdsbroker.VDSCommandBase] (pool-4-thread-48) [56adcb44] Command CancelMigrateVDS execution failed. Exception: VDSErrorException: VDSGenericException: VDSErrorException: Failed to CancelMigrateVDS, error = Unexpected exception attaching full logs again
Created attachment 628616 [details] logs from si20
Thanks Dafna, its look like a mix error, when we start the underline migration vdsm calling to virDomainMigrateToURI2() for libvirt to start migration, since we block the data domain we are in a block mode here and the error that we raise is Fatal (its time out for the domain) The problem is that in the meantime we are calling cancel migration and returning the error of the timeout of the first call (the migrate call) I think we don't need to raise an error if the migration already raise one..
867347 may actually change the behavior. We may need to retest in next build...
according to Comment 17 please retest now that bug 867347 is fixed
is anything still going on regarding this bug?
eventually i didnt start working on solving this, moving back to new
bug 697277 may help here a lot
tested with source host as spm. when blocking the destination host from storage we are getting the following error in audit log and engine log: ********************************** Migration failed (VM: win10, Source: luna, Destination: shaharh) ********************************** which looks like a good behavior. when blocking the source host (ie the spm) we are getting a lot of spm connectivity errors in the engine log and when and we are getting the following error in the vms audit log: ********************************** VM win10 was set to the Unknown status. ********************************** and you don't have a functioning cluster since you don't have spm. in both cases when unblocking the storage the vm return to a running state.
Created attachment 1068644 [details] NonSPM Blocking the destination (Not the SPM)
Created attachment 1068645 [details] CancelSPM Blocking the source (the SPM)
Attached two screenshots, If the migration is in virDomainMigrateToURI3 and we are blocking the storage the vdsm thread is in blocking mode until the storage connection timeout, if we are trying to cancel the migration (ie calling self._vm._dom.abortJob()) the call will not take place until the timeout is ended. The user will see the errors as shown in the CancelSPM and NonSPM screenshots. The vdsm logs show: source log: *************************** Thread-296::ERROR::2015-08-31 15:50:48,074::migration::310::virt.vm::(run) vmId=`6f120e6a-5944-473f-a7ae-350936b59c55`::Failed to migrate Traceback (most recent call last): File "/usr/share/vdsm/virt/migration.py", line 294, in run self._startUnderlyingMigration(time.time()) File "/usr/share/vdsm/virt/migration.py", line 364, in _startUnderlyingMigration self._perform_migration(duri, muri) File "/usr/share/vdsm/virt/migration.py", line 403, in _perform_migration self._vm._dom.migrateToURI3(duri, params, flags) File "/usr/share/vdsm/virt/vm.py", line 231, in f ret = attr(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/libvirtconnection.py", line 124, in wrapper ret = f(*args, **kwargs) File "/usr/lib64/python2.7/site-packages/libvirt.py", line 1836, in migrateToURI3 *************************** dest log: *************************** Thread-51282::ERROR::2015-08-31 15:47:06,470::vm::780::virt.vm::(_startUnderlyingVm) vmId=`6f120e6a-5944-473f-a7ae-350936b59c55`::Failed to start a migration destination vm Traceback (most recent call last): File "/usr/share/vdsm/virt/vm.py", line 750, in _startUnderlyingVm self._completeIncomingMigration() File "/usr/share/vdsm/virt/vm.py", line 2732, in _completeIncomingMigration self._incomingMigrationFinished.isSet(), usedTimeout) File "/usr/share/vdsm/virt/vm.py", line 2785, in _attachLibvirtDomainAfterMigration raise MigrationError(e.get_error_message()) ***************************
It behaves well. But the message is a bit confusing(cancel failed) when it actually is queued and correctly canceled when storage is restored