Description of problem: Both when the customer was running 3.1 and 3.2 he had one of his hosts go non-operational, and the only solution seemed to be restarting the host to clear a hung NFS task, How reproducible: Haven't attempted to reproduce in a test environment, but this happened to the customer multiple times seemingly regardless of software version Steps to Reproduce: 1. Host is fully functional 2. Host attempts to contact storage domain 3. remoteFileHandler.py times out, and the host goes down Actual results: Host goes to non-operational state and has to be rebooted before displaying VM status correctly Expected results: Seamless communication, no interruption of service Additional info: Thread-62515::WARNING::2013-05-14 16:01:03,877::remoteFileHandler::185::Storage.CrabRPCProxy::(callCrabRPCFunction) Problem with handler, treating as timeout Traceback (most recent call last): File "/usr/share/vdsm/storage/remoteFileHandler.py", line 177, in callCrabRPCFunction File "/usr/share/vdsm/storage/remoteFileHandler.py", line 143, in _recvAll Timeout Thread-62515::ERROR::2013-05-14 16:01:03,878::sdc::150::Storage.StorageDomainCache::(_findDomain) Error while looking for domain `a3ebdb74-5120-4406-92a5-da34ec2d3089` Traceback (most recent call last): File "/usr/share/vdsm/storage/sdc.py", line 145, in _findDomain File "/usr/share/vdsm/storage/nfsSD.py", line 155, in findDomain File "/usr/share/vdsm/storage/fileSD.py", line 130, in __init__ File "/usr/share/vdsm/storage/persistentDict.py", line 85, in __getitem__ File "/usr/share/vdsm/storage/persistentDict.py", line 193, in __getitem__ File "/usr/lib64/python2.6/contextlib.py", line 16, in __enter__ File "/usr/share/vdsm/storage/persistentDict.py", line 147, in _accessWrapper File "/usr/share/vdsm/storage/persistentDict.py", line 224, in refresh File "/usr/share/vdsm/storage/fileSD.py", line 99, in readlines File "/usr/share/vdsm/storage/remoteFileHandler.py", line 286, in callCrabRPCFunction Timeout: Operation stuck on remote handler Thread-62515::DEBUG::2013-05-14 16:01:03,880::resourceManager::538::ResourceManager::(releaseResource) Trying to release resource 'Storage.3664c1fc-469a-11e2-a679-077225634133' Thread-62515::DEBUG::2013-05-14 16:01:03,880::resourceManager::553::ResourceManager::(releaseResource) Released resource 'Storage.3664c1fc-469a-11e2-a679-077225634133' (0 active users) Thread-62515::DEBUG::2013-05-14 16:01:03,881::resourceManager::558::ResourceManager::(releaseResource) Resource 'Storage.3664c1fc-469a-11e2-a679-077225634133' is free, finding out if anyone is waiting for it. Thread-62515::DEBUG::2013-05-14 16:01:03,881::resourceManager::565::ResourceManager::(releaseResource) No one is waiting for resource 'Storage.3664c1fc-469a-11e2-a679-077225634133', Clearing records. Thread-62515::ERROR::2013-05-14 16:01:03,882::task::853::TaskManager.Task::(_setError) Task=`5b1bd675-6ad0-457f-9842-e11deed3380a`::Unexpected error Traceback (most recent call last): File "/usr/share/vdsm/storage/task.py", line 861, in _run File "/usr/share/vdsm/logUtils.py", line 38, in wrapper File "/usr/share/vdsm/storage/hsm.py", line 862, in connectStoragePool File "/usr/share/vdsm/storage/hsm.py", line 904, in _connectStoragePool File "/usr/share/vdsm/storage/sp.py", line 648, in connect File "/usr/share/vdsm/storage/sp.py", line 1178, in __rebuild File "/usr/share/vdsm/storage/sp.py", line 1522, in getMasterDomain StoragePoolMasterNotFound: Cannot find master domain: 'spUUID=3664c1fc-469a-11e2-a679-077225634133, msdUUID=a3ebdb74-5120-4406-92a5-da34ec2d3089'
please attach SOS report.
Created attachment 767917 [details] sosreport 6/14
Everything seems to be working as expected. If the host has problems communicating with the NFS server things will obviously start to break. We need to figure out why the NFS is not responding and fix that. As long as VDSM can't read the metadata of the storage it can't connect to the pool.
Just to be clear, The host attempts to connect to a storagePool but since it can not read the master domain (timed out) it fails to connect to the pool. At this point the engine moves it to non-operational (this is standard) as this host is required to be connected to this pool. So this happens due to the host inability to read the master domain, On standard flow the host recovery mechanism kicks in every 5 minutes and basically tries to redo connect to the storahePool. So a few questions: - does it happen to different hosts? (looks like yes) - does it happen always on the same NFS server ? - did you try to access the NFS server manually from the host after it happened? - did you try to restart VDSM instead of reboot - in case it fixes the problem it may indicate a VDSM problem - How much time did you wait before the reboot (to see if the host recovery kicked in and the assumed NFS problem went away) Basically it looks like a connectivity problem to the NFS server, and this might happen due to many reasons.