Hide Forgot
Created attachment 533387 [details] logs Description of problem: blocking connectivity to export domain during vm export. Problem is detected by vdsm and reported to backend but not reported to the user in Admin GUI. 2011-11-13 18:52:18,229 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-62) domain 9fec96e9-4dec-465d-a4a4-fdc44b867154 in problem. vds: blond-vdsg since dd tasks currently continue to run in vdsm and do not fail and backend is not reporting anything to the user, user will not know that there is any problem until tasks fail after 12H. Version-Release number of selected component (if applicable): ic148 How reproducible: 100% Steps to Reproduce: 1. start an Export of vm with force override (so vdsm will use dd and not qemu-convert) 2. once vm started copy, block connectivity to the export domain Actual results: backend log will show that the domain is in problem but no alert will be reported to the Admin GUI (for data domain this is usually an audit log error) Expected results: we should report problem with the domain to the user Additional info: logs Please note that the log not only reports domain as problematic but fails deactivate with CanDoAction: 2011-11-13 19:02:22,529 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-24) domain 9fec96e9-4dec-465d-a4a4-fdc44b867154 was reported by all hosts in status UP as problematic. Moving the Domain to NonOperational. 2011-11-13 19:02:22,534 INFO [org.ovirt.engine.core.bll.storage.HandleFailedStorageDomainCommand] (QuartzScheduler_Worker-24) Running command: HandleFailedStorageDomainCommand internal: true. Entities affected : ID: 9fec96e9-4dec-465d-a4a4-fdc44b867154 Type: Storage 2011-11-13 19:02:22,553 WARN [org.ovirt.engine.core.bll.storage.DeactivateStorageDomainCommand] (QuartzScheduler_Worker-24) CanDoAction of action DeactivateStorageDomain failed. Reasons:VAR__TYPE__STORAGE__DOMAIN,VAR__ACTION__DEACTIVATE,ERROR_CANNOT_DEACTIVATE_DOMAIN_WITH_TASKS 2011-11-13 19:02:22,555 ERROR [org.ovirt.engine.core.bll.storage.HandleFailedStorageDomainCommand] (QuartzScheduler_Worker-24) Transaction rolled-back for command: org.ovirt.engine.core.bll.storage.HandleFailedStorageDomainCommand. 2011-11-13 19:02:22,555 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (QuartzScheduler_Worker-24) No string for UNASSIGNED type. Use default Log 2011-11-13 19:02:26,780 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-38) domain 9fec96e9-4dec-465d-a4a4-fdc44b867154 in problem. vds: blond-vdsg
And also the VM is stuck in 'Image Locked' state?
ok, after a timeout it goes back to normal, but if the operation fails it would be nice to kick it back to the original state immediately
when the Image gets unlocked, the error message appears in the events tab: Failed to export VM ... to .... This happens after quite a long while after I have killed NFS on the storage server, some 10+ minutes.
(In reply to comment #1) > And also the VM is stuck in 'Image Locked' state? the vm will remain in Image locked because the tasks are still running on vds side. once the tasks are failed in vds side the you should get an export failed error.
Dafna: so what you want is to get this error message (and unlock the VM) as soon as vdsm fails and not wait the 10+ minutes until the task fails?
iscsi storage only!
(In reply to comment #6) > iscsi storage only! unrelated in any way to iscsi :) simply depends on the task timeout
Moving to upstream as easy fix, need to add audit log in addition to the warn issued to the log.
isn't this a problem in vdsm? Looks like it gets stuck in vdsm for a very long time even after the storage is crashed TASK: 62935675-7ceb-4035-b07c-c2c09d1460da STATUS: running RESULT: MESSAGE: 'Task is initializing' {'status': {'message': 'OK', 'code': 0}, 'taskStatus': {'message': 'Task is initializing', 'code': 0, 'taskID': '62935675-7ceb-4035-b07c-c2c09d1460da', 'taskResult': '', 'taskState': 'running'}} So the engine can not find out of this that the task is running or not.
this is the exceltion that VDSM logs when it the task finally times out Thread-133::WARNING::2012-08-02 19:02:21,991::remoteFileHandler::183::Storage.CrabRPCProxy::(callCrabRPCFunction) Problem with handler, treating as timeout Traceback (most recent call last): File "/usr/share/vdsm/storage/remoteFileHandler.py", line 175, in callCrabRPCFunction rawLength = self._recvAll(LENGTH_STRUCT_LENGTH, timeout) File "/usr/share/vdsm/storage/remoteFileHandler.py", line 141, in _recvAll raise Timeout() Timeout Thread-133::ERROR::2012-08-02 19:02:21,992::task::833::TaskManager.Task::(_setError) Task=`31700a1e-6a5e-4063-bec5-3ddbaeb726b7`::Unexpected error Traceback (most recent call last): File "/usr/share/vdsm/storage/task.py", line 840, in _run return fn(*args, **kargs) File "/usr/share/vdsm/logUtils.py", line 38, in wrapper res = f(*args, **kwargs) File "/usr/share/vdsm/storage/hsm.py", line 1091, in updateVM self.validateSdUUID(sdUUID) File "/usr/share/vdsm/storage/hsm.py", line 203, in validateSdUUID sdDom.validate() File "/usr/share/vdsm/storage/fileSD.py", line 262, in validate self.getMetadata() File "/usr/share/vdsm/storage/sd.py", line 688, in getMetadata return self._metadata.copy() File "/usr/share/vdsm/storage/persistentDict.py", line 129, in copy md = self._dict.copy() File "/usr/share/vdsm/storage/persistentDict.py", line 318, in copy with self._accessWrapper(): File "/usr/lib64/python2.6/contextlib.py", line 16, in __enter__ return self.gen.next() File "/usr/share/vdsm/storage/persistentDict.py", line 152, in _accessWrapper self.refresh() File "/usr/share/vdsm/storage/persistentDict.py", line 230, in refresh lines = self._metaRW.readlines() File "/usr/share/vdsm/storage/fileSD.py", line 82, in readlines return misc.stripNewLines(self._oop.directReadLines(self._metafile)) File "/usr/share/vdsm/storage/remoteFileHandler.py", line 280, in callCrabRPCFunction raise Timeout("Operation stuck on remote handler") Timeout: Operation stuck on remote handler
http://gerrit.ovirt.org/6869 - a related fix tested again. It takes a lot of time for dd to realize that the NFS mount is gone, but then the task will fail relatively soon (couple of minutes)
(In reply to comment #12) > http://gerrit.ovirt.org/6869 - a related fix > > tested again. It takes a lot of time for dd to realize that the NFS mount is > gone, but then the task will fail relatively soon (couple of minutes) but does the user have any indication that this has failed? in task management gui?
> > Please note that the log not only reports domain as problematic but fails > deactivate with CanDoAction: Michael, will the domain still fail deactivate after your changes in this scenario? > > 2011-11-13 19:02:22,529 WARN > [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] > (QuartzScheduler_Worker-24) domain 9fec96e9-4dec-465d-a4a4-fdc44b867154 was > reported by all hosts in status UP as problematic. Moving the Domain to > NonOperational. > 2011-11-13 19:02:22,534 INFO > [org.ovirt.engine.core.bll.storage.HandleFailedStorageDomainCommand] > (QuartzScheduler_Worker-24) Running command: > HandleFailedStorageDomainCommand internal: true. Entities affected : ID: > 9fec96e9-4dec-465d-a4a4-fdc44b867154 Type: Storage > 2011-11-13 19:02:22,553 WARN > [org.ovirt.engine.core.bll.storage.DeactivateStorageDomainCommand] > (QuartzScheduler_Worker-24) CanDoAction of action DeactivateStorageDomain > failed. > Reasons:VAR__TYPE__STORAGE__DOMAIN,VAR__ACTION__DEACTIVATE, > ERROR_CANNOT_DEACTIVATE_DOMAIN_WITH_TASKS > 2011-11-13 19:02:22,555 ERROR > [org.ovirt.engine.core.bll.storage.HandleFailedStorageDomainCommand] > (QuartzScheduler_Worker-24) Transaction rolled-back for command: > org.ovirt.engine.core.bll.storage.HandleFailedStorageDomainCommand. > 2011-11-13 19:02:22,555 INFO > [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] > (QuartzScheduler_Worker-24) No string for UNASSIGNED type. Use default Log > 2011-11-13 19:02:26,780 WARN > [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] > (QuartzScheduler_Worker-38) domain 9fec96e9-4dec-465d-a4a4-fdc44b867154 in > problem. vds: blond-vdsg
There are 2 flows in engine: 1. the running task - this will take a very long time to fail and will not cause any report to user 2. domain monitoring - this is what wrote the error to the engine log. Question is, in this scenario can user get any indication about an error in the GUI (seeing as backend fails to deactivate the domain due to the running task).
(In reply to comment #13) > (In reply to comment #12) > > http://gerrit.ovirt.org/6869 - a related fix > > > > tested again. It takes a lot of time for dd to realize that the NFS mount is > > gone, but then the task will fail relatively soon (couple of minutes) > > but does the user have any indication that this has failed? in task > management gui? In the task GUI you get the status change with the regular polls, therefore not very quick, but with the next poll. For me it seems the dd process (that vdsm started) should fail sooner, but it is just stuck for very long time. I did not find a way to make dd fail sooner.
(In reply to comment #16) > (In reply to comment #13) > > (In reply to comment #12) > > > http://gerrit.ovirt.org/6869 - a related fix > > > > > > tested again. It takes a lot of time for dd to realize that the NFS mount is > > > gone, but then the task will fail relatively soon (couple of minutes) > > > > but does the user have any indication that this has failed? in task > > management gui? > > In the task GUI you get the status change with the regular polls, therefore > not very quick, but with the next poll. > > For me it seems the dd process (that vdsm started) should fail sooner, but > it is just stuck for very long time. I did not find a way to make dd fail > sooner. There is no reason to fail dd, storage problem could be transient and dd could recover. Only thing that is needed is to notify the user of the situation earlier if possible.
http://gerrit.ovirt.org/#/c/7957/1
*** Bug 815858 has been marked as a duplicate of this bug. ***
verified on si24