Created attachment 784304 [details] logs Description of problem: spmStart fails (forever) after vdsm crashed during running task (of formatStorageDomain in my case) Version-Release number of selected component (if applicable): vdsm-4.12.0-rc3.13.git06ed3cc.el6ev.x86_64 How reproducible: unknown Steps to Reproduce: On a block pool with connected pool and ISO domain in maintenance: - detach the ISO domain - remove the domain (with format domain) and stop vdsm daemon right after - start vdsm daemon Actual results: When vdsm comes up, it fails to perform spmStart with this error: b59f7dd0-c434-4d1e-8840-3be7c112bd80::ERROR::2013-08-08 13:18:25,536::task::850::TaskManager.Task::(_setError) Task=`b59f7dd0-c434-4d1e-8840-3be7c112bd80`::Unexpected error Traceback (most recent call last): File "/usr/share/vdsm/storage/task.py", line 857, in _run return fn(*args, **kargs) File "/usr/share/vdsm/storage/task.py", line 318, in run return self.cmd(*self.argslist, **self.argsdict) File "/usr/share/vdsm/storage/sp.py", line 293, in startSpm self.masterDomain.createMasterTree() File "/usr/share/vdsm/storage/sd.py", line 621, in createMasterTree self.oop.fileUtils.createdir(vmsDir) File "/usr/share/vdsm/storage/remoteFileHandler.py", line 284, in callCrabRPCFunction *args, **kwargs) File "/usr/share/vdsm/storage/remoteFileHandler.py", line 184, in callCrabRPCFunction rawLength = self._recvAll(LENGTH_STRUCT_LENGTH, timeout) File "/usr/share/vdsm/storage/remoteFileHandler.py", line 153, in _recvAll timeLeft): File "/usr/lib64/python2.6/contextlib.py", line 83, in helper return GeneratorContextManager(func(*args, **kwds)) File "/usr/share/vdsm/storage/remoteFileHandler.py", line 140, in _poll raise Timeout() Timeout Host cannot take SPM Expected results: spmStart should succeed after vdsm comes up from crash Additional info: logs
formatStorageDomain is not an spm task and is unrelated. The problem here is a that a simple create dir operation failed. Can you reproduce this? Note that the problem first start at: 7990b427-c081-4e12-81bd-f14dcbbc0ce7::DEBUG::2013-08-08 12:47:09,690::sd::620::Storage.StorageDomain::(createMasterTree) creating vms dir: /rhev/data-center/mnt/blockSD/afd63d97-e4f6-433c-87db-a5fea8a4c3b8/master/vms 7990b427-c081-4e12-81bd-f14dcbbc0ce7::ERROR::2013-08-08 12:47:09,740::sp::342::Storage.StoragePool::(startSpm) Unexpected error And that unfortunately we don't have enough info here to know what the failure was.
we will try to reproduce and add new logs.
Reproduced on IS13: seems to happen when engine sends: 2013-09-12 11:37:08,169 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.DisconnectStorageServerVDSCommand] (ajp-/127.0.0.1:8702-2) START, DisconnectStorageServerVDSCommand(HostName = nott-vds2, HostId = 41266ecf-4268-45b0-b01d-d124020b9821, storagePoolId = 00000000-0000-0000-0000-000000000000, storageType = NFS, connectionList = [{ id: a4b5827a-9173-4edb-a545-56666ab4748e, connection: lion.qa.lab:/export/elad/elad1, iqn: null, vfsType: null, mountOptions: null, nfsVersion: null, nfsRetrans: null, nfsTimeo: null };]), log id: 62b394ee and before engine reports about "FINISH, , DisconnectStorageServerVDSComman" it logs: 2013-09-12 11:37:09,151 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.ListVDSCommand] (DefaultQuartzScheduler_Worker-9) Command ListVDS execution failed. Exception: VDSNetworkException: java.net.SocketException: Connection reset logs will be attached
Created attachment 796706 [details] logs(2)
(In reply to Elad from comment #0) > spmStart fails (forever) after vdsm crashed during running task (of > formatStorageDomain in my case) What do you mean by "vdsm crashed"? > Steps to Reproduce: > On a block pool with connected pool and ISO domain in maintenance: > - detach the ISO domain > - remove the domain (with format domain) and stop vdsm daemon right after > - start vdsm daemon > > Actual results: > When vdsm comes up, it fails to perform spmStart with this error: > > b59f7dd0-c434-4d1e-8840-3be7c112bd80::ERROR::2013-08-08 > 13:18:25,536::task::850::TaskManager.Task::(_setError) > Task=`b59f7dd0-c434-4d1e-8840-3be7c112bd80`::Unexpected error > Traceback (most recent call last): > File "/usr/share/vdsm/storage/task.py", line 857, in _run > return fn(*args, **kargs) > File "/usr/share/vdsm/storage/task.py", line 318, in run > return self.cmd(*self.argslist, **self.argsdict) > File "/usr/share/vdsm/storage/sp.py", line 293, in startSpm > self.masterDomain.createMasterTree() > File "/usr/share/vdsm/storage/sd.py", line 621, in createMasterTree > self.oop.fileUtils.createdir(vmsDir) > File "/usr/share/vdsm/storage/remoteFileHandler.py", line 284, in > callCrabRPCFunction > *args, **kwargs) > File "/usr/share/vdsm/storage/remoteFileHandler.py", line 184, in > callCrabRPCFunction > rawLength = self._recvAll(LENGTH_STRUCT_LENGTH, timeout) > File "/usr/share/vdsm/storage/remoteFileHandler.py", line 153, in _recvAll > timeLeft): > File "/usr/lib64/python2.6/contextlib.py", line 83, in helper > return GeneratorContextManager(func(*args, **kwds)) > File "/usr/share/vdsm/storage/remoteFileHandler.py", line 140, in _poll > raise Timeout() > Timeout > I cannot find such error or any spmStart error in logs(2). Looks like you cannot reproduce this bug.
Elad could not reproduce this with latest version.