Bug 994973 - [vdsm] host cannot perform spmStart after vdsm crashed with running task
Summary: [vdsm] host cannot perform spmStart after vdsm crashed with running task
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm
Version: 3.3.0
Hardware: x86_64
OS: Unspecified
unspecified
high
Target Milestone: ---
: 3.3.0
Assignee: Nir Soffer
QA Contact: Elad
URL:
Whiteboard: storage
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2013-08-08 10:29 UTC by Elad
Modified: 2016-02-10 18:29 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2013-11-28 13:03:21 UTC
oVirt Team: Storage
Target Upstream Version:
Embargoed:
nsoffer: needinfo-
amureini: Triaged+


Attachments (Terms of Use)
logs (5.09 MB, application/x-gzip)
2013-08-08 10:29 UTC, Elad
no flags Details
logs(2) (1.31 MB, application/x-gzip)
2013-09-12 08:59 UTC, Elad
no flags Details

Description Elad 2013-08-08 10:29:35 UTC
Created attachment 784304 [details]
logs

Description of problem:
spmStart fails (forever) after vdsm crashed during running task (of formatStorageDomain in my case) 


Version-Release number of selected component (if applicable):
vdsm-4.12.0-rc3.13.git06ed3cc.el6ev.x86_64

How reproducible:
unknown

Steps to Reproduce:
On a block pool with connected pool and ISO domain in maintenance:
- detach the ISO domain
- remove the domain (with format domain) and stop vdsm daemon right after
- start vdsm daemon

Actual results:
When vdsm comes up, it fails to perform spmStart with this error:

b59f7dd0-c434-4d1e-8840-3be7c112bd80::ERROR::2013-08-08 13:18:25,536::task::850::TaskManager.Task::(_setError) Task=`b59f7dd0-c434-4d1e-8840-3be7c112bd80`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 857, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/storage/task.py", line 318, in run
    return self.cmd(*self.argslist, **self.argsdict)
  File "/usr/share/vdsm/storage/sp.py", line 293, in startSpm
    self.masterDomain.createMasterTree()
  File "/usr/share/vdsm/storage/sd.py", line 621, in createMasterTree
    self.oop.fileUtils.createdir(vmsDir)
  File "/usr/share/vdsm/storage/remoteFileHandler.py", line 284, in callCrabRPCFunction
    *args, **kwargs)
  File "/usr/share/vdsm/storage/remoteFileHandler.py", line 184, in callCrabRPCFunction
    rawLength = self._recvAll(LENGTH_STRUCT_LENGTH, timeout)
  File "/usr/share/vdsm/storage/remoteFileHandler.py", line 153, in _recvAll
    timeLeft):
  File "/usr/lib64/python2.6/contextlib.py", line 83, in helper
    return GeneratorContextManager(func(*args, **kwds))
  File "/usr/share/vdsm/storage/remoteFileHandler.py", line 140, in _poll
    raise Timeout()
Timeout


Host cannot take SPM 

Expected results:
spmStart should succeed after vdsm comes up from crash

Additional info: logs

Comment 1 Ayal Baron 2013-09-01 14:11:31 UTC
formatStorageDomain is not an spm task and is unrelated.
The problem here is a that a simple create dir operation failed.
Can you reproduce this?


Note that the problem first start at:
7990b427-c081-4e12-81bd-f14dcbbc0ce7::DEBUG::2013-08-08 12:47:09,690::sd::620::Storage.StorageDomain::(createMasterTree) creating vms dir: /rhev/data-center/mnt/blockSD/afd63d97-e4f6-433c-87db-a5fea8a4c3b8/master/vms
7990b427-c081-4e12-81bd-f14dcbbc0ce7::ERROR::2013-08-08 12:47:09,740::sp::342::Storage.StoragePool::(startSpm) Unexpected error

And that unfortunately we don't have enough info here to know what the failure was.

Comment 2 Aharon Canan 2013-09-12 08:08:20 UTC
we will try to reproduce and add new logs.

Comment 3 Elad 2013-09-12 08:58:22 UTC
Reproduced on IS13:

seems to happen when engine sends:

2013-09-12 11:37:08,169 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.DisconnectStorageServerVDSCommand] (ajp-/127.0.0.1:8702-2) START, DisconnectStorageServerVDSCommand(HostName = nott-vds2, HostId = 41266ecf-4268-45b0-b01d-d124020b9821, storagePoolId = 00000000-0000-0000-0000-000000000000, storageType = NFS, connectionList = [{ id: a4b5827a-9173-4edb-a545-56666ab4748e, connection: lion.qa.lab:/export/elad/elad1, iqn: null, vfsType: null, mountOptions: null, nfsVersion: null, nfsRetrans: null, nfsTimeo: null };]), log id: 62b394ee


and before engine reports about "FINISH, , DisconnectStorageServerVDSComman" it logs:

2013-09-12 11:37:09,151 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.ListVDSCommand] (DefaultQuartzScheduler_Worker-9) Command ListVDS execution failed. Exception: VDSNetworkException: java.net.SocketException: Connection reset


logs will be attached

Comment 4 Elad 2013-09-12 08:59:11 UTC
Created attachment 796706 [details]
logs(2)

Comment 5 Nir Soffer 2013-11-27 07:13:58 UTC
(In reply to Elad from comment #0)
> spmStart fails (forever) after vdsm crashed during running task (of
> formatStorageDomain in my case)

What do you mean by "vdsm crashed"?

> Steps to Reproduce:
> On a block pool with connected pool and ISO domain in maintenance:
> - detach the ISO domain
> - remove the domain (with format domain) and stop vdsm daemon right after
> - start vdsm daemon
> 
> Actual results:
> When vdsm comes up, it fails to perform spmStart with this error:
> 
> b59f7dd0-c434-4d1e-8840-3be7c112bd80::ERROR::2013-08-08
> 13:18:25,536::task::850::TaskManager.Task::(_setError)
> Task=`b59f7dd0-c434-4d1e-8840-3be7c112bd80`::Unexpected error
> Traceback (most recent call last):
>   File "/usr/share/vdsm/storage/task.py", line 857, in _run
>     return fn(*args, **kargs)
>   File "/usr/share/vdsm/storage/task.py", line 318, in run
>     return self.cmd(*self.argslist, **self.argsdict)
>   File "/usr/share/vdsm/storage/sp.py", line 293, in startSpm
>     self.masterDomain.createMasterTree()
>   File "/usr/share/vdsm/storage/sd.py", line 621, in createMasterTree
>     self.oop.fileUtils.createdir(vmsDir)
>   File "/usr/share/vdsm/storage/remoteFileHandler.py", line 284, in
> callCrabRPCFunction
>     *args, **kwargs)
>   File "/usr/share/vdsm/storage/remoteFileHandler.py", line 184, in
> callCrabRPCFunction
>     rawLength = self._recvAll(LENGTH_STRUCT_LENGTH, timeout)
>   File "/usr/share/vdsm/storage/remoteFileHandler.py", line 153, in _recvAll
>     timeLeft):
>   File "/usr/lib64/python2.6/contextlib.py", line 83, in helper
>     return GeneratorContextManager(func(*args, **kwds))
>   File "/usr/share/vdsm/storage/remoteFileHandler.py", line 140, in _poll
>     raise Timeout()
> Timeout
> 

I cannot find such error or any spmStart error in logs(2). Looks like you cannot reproduce this bug.

Comment 6 Nir Soffer 2013-11-28 13:03:21 UTC
Elad could not reproduce this with latest version.


Note You need to log in before you can comment on or make changes to this bug.