Bug 753591

Summary: Backend: problem in export domain will not be reported to user
Product: Red Hat Enterprise Virtualization Manager Reporter: Dafna Ron <dron>
Component: ovirt-engineAssignee: Laszlo Hornyak <lhornyak>
Status: CLOSED CURRENTRELEASE QA Contact: Dafna Ron <dron>
Severity: high Docs Contact:
Priority: medium    
Version: unspecifiedCC: abaron, acathrow, amureini, dfediuck, iheim, knesenko, lpeer, mkenneth, ofrenkel, Rhev-m-bugs, yeylon, ykaul, yzaslavs
Target Milestone: ---   
Target Release: 3.1.0   
Hardware: x86_64   
OS: Linux   
Whiteboard: storage
Fixed In Version: SI24 Doc Type: Bug Fix
Doc Text:
If exporting a VM fails due to a storage problem on the export domain, a failed task will be shown in the webadmin.
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-12-04 20:01:38 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
logs none

Description Dafna Ron 2011-11-13 17:10:58 UTC
Created attachment 533387 [details]
logs

Description of problem:

blocking connectivity to export domain during vm export. 
Problem is detected by vdsm and reported to backend but not reported to the user in Admin GUI. 

2011-11-13 18:52:18,229 WARN  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-62) domain 9fec96e9-4dec-465d-a4a4-fdc44b867154 in problem. vds: blond-vdsg

since dd tasks currently continue to run in vdsm and do not fail and backend is not reporting anything to the user, user will not know that there is any problem until tasks fail after 12H. 

Version-Release number of selected component (if applicable):

ic148

How reproducible:

100%

Steps to Reproduce:
1. start an Export of vm with force override (so vdsm will use dd and not qemu-convert)
2. once vm started copy, block connectivity to the export domain

  
Actual results:

backend log will show that the domain is in problem but no alert will be reported to the Admin GUI (for data domain this is usually an audit log error)

Expected results:

we should report problem with the domain to the user

Additional info: logs

Please note that the log not only reports domain as problematic but fails deactivate with CanDoAction:

2011-11-13 19:02:22,529 WARN  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-24) domain 9fec96e9-4dec-465d-a4a4-fdc44b867154 was reported by all hosts in status UP as problematic. Moving the Domain to NonOperational.
2011-11-13 19:02:22,534 INFO  [org.ovirt.engine.core.bll.storage.HandleFailedStorageDomainCommand] (QuartzScheduler_Worker-24) Running command: HandleFailedStorageDomainCommand internal: true. Entities affected :  ID: 9fec96e9-4dec-465d-a4a4-fdc44b867154 Type: Storage
2011-11-13 19:02:22,553 WARN  [org.ovirt.engine.core.bll.storage.DeactivateStorageDomainCommand] (QuartzScheduler_Worker-24) CanDoAction of action DeactivateStorageDomain failed. Reasons:VAR__TYPE__STORAGE__DOMAIN,VAR__ACTION__DEACTIVATE,ERROR_CANNOT_DEACTIVATE_DOMAIN_WITH_TASKS
2011-11-13 19:02:22,555 ERROR [org.ovirt.engine.core.bll.storage.HandleFailedStorageDomainCommand] (QuartzScheduler_Worker-24) Transaction rolled-back for command: org.ovirt.engine.core.bll.storage.HandleFailedStorageDomainCommand.
2011-11-13 19:02:22,555 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (QuartzScheduler_Worker-24) No string for UNASSIGNED type. Use default Log
2011-11-13 19:02:26,780 WARN  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-38) domain 9fec96e9-4dec-465d-a4a4-fdc44b867154 in problem. vds: blond-vdsg

Comment 1 Laszlo Hornyak 2011-11-18 15:24:37 UTC
And also the VM is stuck in 'Image Locked' state?

Comment 2 Laszlo Hornyak 2011-11-18 15:35:12 UTC
ok, after a timeout it goes back to normal, but if the operation fails it would be nice to kick it back to the original state immediately

Comment 3 Laszlo Hornyak 2011-11-18 17:11:21 UTC
when the Image gets unlocked, the error message appears in the events tab: Failed to export VM ... to ....

This happens after quite a long while after I have killed NFS on the storage server, some 10+ minutes.

Comment 4 Dafna Ron 2011-11-20 08:35:56 UTC
(In reply to comment #1)
> And also the VM is stuck in 'Image Locked' state?

the vm will remain in Image locked because the tasks are still running on vds side. 
once the tasks are failed in vds side the you should get an export failed error.

Comment 5 Laszlo Hornyak 2011-11-21 09:09:14 UTC
Dafna: so what you want is to get this error message (and unlock the VM) as soon as vdsm fails and not wait the 10+ minutes until the task fails?

Comment 6 Laszlo Hornyak 2011-11-24 08:15:41 UTC
iscsi storage only!

Comment 7 Laszlo Hornyak 2011-11-24 14:18:56 UTC
(In reply to comment #6)
> iscsi storage only!

unrelated in any way to iscsi :) simply depends on the task timeout

Comment 8 lpeer 2012-01-30 08:13:30 UTC
Moving to upstream as easy fix, need to add audit log in addition to the warn issued to the log.

Comment 10 Laszlo Hornyak 2012-07-31 15:28:42 UTC
isn't this a problem in vdsm? Looks like it gets stuck in vdsm for a very long time even after the storage is crashed

TASK: 62935675-7ceb-4035-b07c-c2c09d1460da STATUS: running RESULT:  MESSAGE: 'Task is initializing'
{'status': {'message': 'OK', 'code': 0}, 'taskStatus': {'message': 'Task is initializing', 'code': 0, 'taskID': '62935675-7ceb-4035-b07c-c2c09d1460da', 'taskResult': '', 'taskState': 'running'}}

So the engine can not find out of this that the task is running or not.

Comment 11 Laszlo Hornyak 2012-08-02 17:10:15 UTC
this is the exceltion that VDSM logs when it the task finally times out

Thread-133::WARNING::2012-08-02 19:02:21,991::remoteFileHandler::183::Storage.CrabRPCProxy::(callCrabRPCFunction) Problem with handler, treating as timeout
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/remoteFileHandler.py", line 175, in callCrabRPCFunction
    rawLength = self._recvAll(LENGTH_STRUCT_LENGTH, timeout)
  File "/usr/share/vdsm/storage/remoteFileHandler.py", line 141, in _recvAll
    raise Timeout()
Timeout
Thread-133::ERROR::2012-08-02 19:02:21,992::task::833::TaskManager.Task::(_setError) Task=`31700a1e-6a5e-4063-bec5-3ddbaeb726b7`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 840, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/logUtils.py", line 38, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 1091, in updateVM
    self.validateSdUUID(sdUUID)
  File "/usr/share/vdsm/storage/hsm.py", line 203, in validateSdUUID
    sdDom.validate()
  File "/usr/share/vdsm/storage/fileSD.py", line 262, in validate
    self.getMetadata()
  File "/usr/share/vdsm/storage/sd.py", line 688, in getMetadata
    return self._metadata.copy()
  File "/usr/share/vdsm/storage/persistentDict.py", line 129, in copy
    md = self._dict.copy()
  File "/usr/share/vdsm/storage/persistentDict.py", line 318, in copy
    with self._accessWrapper():
  File "/usr/lib64/python2.6/contextlib.py", line 16, in __enter__
    return self.gen.next()
  File "/usr/share/vdsm/storage/persistentDict.py", line 152, in _accessWrapper
    self.refresh()
  File "/usr/share/vdsm/storage/persistentDict.py", line 230, in refresh
    lines = self._metaRW.readlines()
  File "/usr/share/vdsm/storage/fileSD.py", line 82, in readlines
    return misc.stripNewLines(self._oop.directReadLines(self._metafile))
  File "/usr/share/vdsm/storage/remoteFileHandler.py", line 280, in callCrabRPCFunction
    raise Timeout("Operation stuck on remote handler")
Timeout: Operation stuck on remote handler

Comment 12 Laszlo Hornyak 2012-08-03 15:41:45 UTC
http://gerrit.ovirt.org/6869 - a related fix

tested again. It takes a lot of time for dd to realize that the NFS mount is gone, but then the task will fail relatively soon (couple of minutes)

Comment 13 Ayal Baron 2012-08-08 08:58:19 UTC
(In reply to comment #12)
> http://gerrit.ovirt.org/6869 - a related fix
> 
> tested again. It takes a lot of time for dd to realize that the NFS mount is
> gone, but then the task will fail relatively soon (couple of minutes)

but does the user have any indication that this has failed? in task management gui?

Comment 14 Ayal Baron 2012-09-02 07:57:23 UTC
> 
> Please note that the log not only reports domain as problematic but fails
> deactivate with CanDoAction:

Michael, will the domain still fail deactivate after your changes in this scenario?

> 
> 2011-11-13 19:02:22,529 WARN 
> [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand]
> (QuartzScheduler_Worker-24) domain 9fec96e9-4dec-465d-a4a4-fdc44b867154 was
> reported by all hosts in status UP as problematic. Moving the Domain to
> NonOperational.
> 2011-11-13 19:02:22,534 INFO 
> [org.ovirt.engine.core.bll.storage.HandleFailedStorageDomainCommand]
> (QuartzScheduler_Worker-24) Running command:
> HandleFailedStorageDomainCommand internal: true. Entities affected :  ID:
> 9fec96e9-4dec-465d-a4a4-fdc44b867154 Type: Storage
> 2011-11-13 19:02:22,553 WARN 
> [org.ovirt.engine.core.bll.storage.DeactivateStorageDomainCommand]
> (QuartzScheduler_Worker-24) CanDoAction of action DeactivateStorageDomain
> failed.
> Reasons:VAR__TYPE__STORAGE__DOMAIN,VAR__ACTION__DEACTIVATE,
> ERROR_CANNOT_DEACTIVATE_DOMAIN_WITH_TASKS
> 2011-11-13 19:02:22,555 ERROR
> [org.ovirt.engine.core.bll.storage.HandleFailedStorageDomainCommand]
> (QuartzScheduler_Worker-24) Transaction rolled-back for command:
> org.ovirt.engine.core.bll.storage.HandleFailedStorageDomainCommand.
> 2011-11-13 19:02:22,555 INFO 
> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
> (QuartzScheduler_Worker-24) No string for UNASSIGNED type. Use default Log
> 2011-11-13 19:02:26,780 WARN 
> [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand]
> (QuartzScheduler_Worker-38) domain 9fec96e9-4dec-465d-a4a4-fdc44b867154 in
> problem. vds: blond-vdsg

Comment 15 Ayal Baron 2012-09-03 11:13:15 UTC
There are 2 flows in engine:
1. the running task - this will take a very long time to fail and will not cause any report to user
2. domain monitoring - this is what wrote the error to the engine log.
Question is, in this scenario can user get any indication about an error in the GUI (seeing as backend fails to deactivate the domain due to the running task).

Comment 16 Laszlo Hornyak 2012-09-03 11:55:48 UTC
(In reply to comment #13)
> (In reply to comment #12)
> > http://gerrit.ovirt.org/6869 - a related fix
> > 
> > tested again. It takes a lot of time for dd to realize that the NFS mount is
> > gone, but then the task will fail relatively soon (couple of minutes)
> 
> but does the user have any indication that this has failed? in task
> management gui?

In the task GUI you get the status change with the regular polls, therefore not very quick, but with the next poll.

For me it seems the dd process (that vdsm started) should fail sooner, but it is just stuck for very long time. I did not find a way to make dd fail sooner.

Comment 17 Ayal Baron 2012-09-12 07:58:50 UTC
(In reply to comment #16)
> (In reply to comment #13)
> > (In reply to comment #12)
> > > http://gerrit.ovirt.org/6869 - a related fix
> > > 
> > > tested again. It takes a lot of time for dd to realize that the NFS mount is
> > > gone, but then the task will fail relatively soon (couple of minutes)
> > 
> > but does the user have any indication that this has failed? in task
> > management gui?
> 
> In the task GUI you get the status change with the regular polls, therefore
> not very quick, but with the next poll.
> 
> For me it seems the dd process (that vdsm started) should fail sooner, but
> it is just stuck for very long time. I did not find a way to make dd fail
> sooner.

There is no reason to fail dd, storage problem could be transient and dd could recover.

Only thing that is needed is to notify the user of the situation earlier if possible.

Comment 18 Laszlo Hornyak 2012-09-12 14:17:51 UTC
http://gerrit.ovirt.org/#/c/7957/1

Comment 20 Ayal Baron 2012-10-16 14:31:30 UTC
*** Bug 815858 has been marked as a duplicate of this bug. ***

Comment 23 Dafna Ron 2012-11-05 15:59:26 UTC
verified on si24