Bug 994973 - [vdsm] host cannot perform spmStart after vdsm crashed with running task
[vdsm] host cannot perform spmStart after vdsm crashed with running task
Status: CLOSED WORKSFORME
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm (Show other bugs)
3.3.0
x86_64 Unspecified
unspecified Severity high
: ---
: 3.3.0
Assigned To: Nir Soffer
Elad
storage
: Reopened
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2013-08-08 06:29 EDT by Elad
Modified: 2016-02-10 13:29 EST (History)
9 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2013-11-28 08:03:21 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: Storage
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
nsoffer: needinfo-
amureini: Triaged+


Attachments (Terms of Use)
logs (5.09 MB, application/x-gzip)
2013-08-08 06:29 EDT, Elad
no flags Details
logs(2) (1.31 MB, application/x-gzip)
2013-09-12 04:59 EDT, Elad
no flags Details

  None (edit)
Description Elad 2013-08-08 06:29:35 EDT
Created attachment 784304 [details]
logs

Description of problem:
spmStart fails (forever) after vdsm crashed during running task (of formatStorageDomain in my case) 


Version-Release number of selected component (if applicable):
vdsm-4.12.0-rc3.13.git06ed3cc.el6ev.x86_64

How reproducible:
unknown

Steps to Reproduce:
On a block pool with connected pool and ISO domain in maintenance:
- detach the ISO domain
- remove the domain (with format domain) and stop vdsm daemon right after
- start vdsm daemon

Actual results:
When vdsm comes up, it fails to perform spmStart with this error:

b59f7dd0-c434-4d1e-8840-3be7c112bd80::ERROR::2013-08-08 13:18:25,536::task::850::TaskManager.Task::(_setError) Task=`b59f7dd0-c434-4d1e-8840-3be7c112bd80`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 857, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/storage/task.py", line 318, in run
    return self.cmd(*self.argslist, **self.argsdict)
  File "/usr/share/vdsm/storage/sp.py", line 293, in startSpm
    self.masterDomain.createMasterTree()
  File "/usr/share/vdsm/storage/sd.py", line 621, in createMasterTree
    self.oop.fileUtils.createdir(vmsDir)
  File "/usr/share/vdsm/storage/remoteFileHandler.py", line 284, in callCrabRPCFunction
    *args, **kwargs)
  File "/usr/share/vdsm/storage/remoteFileHandler.py", line 184, in callCrabRPCFunction
    rawLength = self._recvAll(LENGTH_STRUCT_LENGTH, timeout)
  File "/usr/share/vdsm/storage/remoteFileHandler.py", line 153, in _recvAll
    timeLeft):
  File "/usr/lib64/python2.6/contextlib.py", line 83, in helper
    return GeneratorContextManager(func(*args, **kwds))
  File "/usr/share/vdsm/storage/remoteFileHandler.py", line 140, in _poll
    raise Timeout()
Timeout


Host cannot take SPM 

Expected results:
spmStart should succeed after vdsm comes up from crash

Additional info: logs
Comment 1 Ayal Baron 2013-09-01 10:11:31 EDT
formatStorageDomain is not an spm task and is unrelated.
The problem here is a that a simple create dir operation failed.
Can you reproduce this?


Note that the problem first start at:
7990b427-c081-4e12-81bd-f14dcbbc0ce7::DEBUG::2013-08-08 12:47:09,690::sd::620::Storage.StorageDomain::(createMasterTree) creating vms dir: /rhev/data-center/mnt/blockSD/afd63d97-e4f6-433c-87db-a5fea8a4c3b8/master/vms
7990b427-c081-4e12-81bd-f14dcbbc0ce7::ERROR::2013-08-08 12:47:09,740::sp::342::Storage.StoragePool::(startSpm) Unexpected error

And that unfortunately we don't have enough info here to know what the failure was.
Comment 2 Aharon Canan 2013-09-12 04:08:20 EDT
we will try to reproduce and add new logs.
Comment 3 Elad 2013-09-12 04:58:22 EDT
Reproduced on IS13:

seems to happen when engine sends:

2013-09-12 11:37:08,169 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.DisconnectStorageServerVDSCommand] (ajp-/127.0.0.1:8702-2) START, DisconnectStorageServerVDSCommand(HostName = nott-vds2, HostId = 41266ecf-4268-45b0-b01d-d124020b9821, storagePoolId = 00000000-0000-0000-0000-000000000000, storageType = NFS, connectionList = [{ id: a4b5827a-9173-4edb-a545-56666ab4748e, connection: lion.qa.lab:/export/elad/elad1, iqn: null, vfsType: null, mountOptions: null, nfsVersion: null, nfsRetrans: null, nfsTimeo: null };]), log id: 62b394ee


and before engine reports about "FINISH, , DisconnectStorageServerVDSComman" it logs:

2013-09-12 11:37:09,151 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.ListVDSCommand] (DefaultQuartzScheduler_Worker-9) Command ListVDS execution failed. Exception: VDSNetworkException: java.net.SocketException: Connection reset


logs will be attached
Comment 4 Elad 2013-09-12 04:59:11 EDT
Created attachment 796706 [details]
logs(2)
Comment 5 Nir Soffer 2013-11-27 02:13:58 EST
(In reply to Elad from comment #0)
> spmStart fails (forever) after vdsm crashed during running task (of
> formatStorageDomain in my case)

What do you mean by "vdsm crashed"?

> Steps to Reproduce:
> On a block pool with connected pool and ISO domain in maintenance:
> - detach the ISO domain
> - remove the domain (with format domain) and stop vdsm daemon right after
> - start vdsm daemon
> 
> Actual results:
> When vdsm comes up, it fails to perform spmStart with this error:
> 
> b59f7dd0-c434-4d1e-8840-3be7c112bd80::ERROR::2013-08-08
> 13:18:25,536::task::850::TaskManager.Task::(_setError)
> Task=`b59f7dd0-c434-4d1e-8840-3be7c112bd80`::Unexpected error
> Traceback (most recent call last):
>   File "/usr/share/vdsm/storage/task.py", line 857, in _run
>     return fn(*args, **kargs)
>   File "/usr/share/vdsm/storage/task.py", line 318, in run
>     return self.cmd(*self.argslist, **self.argsdict)
>   File "/usr/share/vdsm/storage/sp.py", line 293, in startSpm
>     self.masterDomain.createMasterTree()
>   File "/usr/share/vdsm/storage/sd.py", line 621, in createMasterTree
>     self.oop.fileUtils.createdir(vmsDir)
>   File "/usr/share/vdsm/storage/remoteFileHandler.py", line 284, in
> callCrabRPCFunction
>     *args, **kwargs)
>   File "/usr/share/vdsm/storage/remoteFileHandler.py", line 184, in
> callCrabRPCFunction
>     rawLength = self._recvAll(LENGTH_STRUCT_LENGTH, timeout)
>   File "/usr/share/vdsm/storage/remoteFileHandler.py", line 153, in _recvAll
>     timeLeft):
>   File "/usr/lib64/python2.6/contextlib.py", line 83, in helper
>     return GeneratorContextManager(func(*args, **kwds))
>   File "/usr/share/vdsm/storage/remoteFileHandler.py", line 140, in _poll
>     raise Timeout()
> Timeout
> 

I cannot find such error or any spmStart error in logs(2). Looks like you cannot reproduce this bug.
Comment 6 Nir Soffer 2013-11-28 08:03:21 EST
Elad could not reproduce this with latest version.

Note You need to log in before you can comment on or make changes to this bug.