Bug 979553 - remoteFileHandler.py timeout causes host to go non-operational
Summary: remoteFileHandler.py timeout causes host to go non-operational
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm
Version: 3.2.0
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 3.3.0
Assignee: Saggi Mizrahi
QA Contact: Barak Dagan
URL:
Whiteboard: infra
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2013-06-28 19:56 UTC by wdaniel
Modified: 2018-12-02 17:33 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2013-09-12 10:15:08 UTC
oVirt Team: Infra
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
sosreport 6/14 (17.86 MB, text/plain)
2013-07-02 18:47 UTC, wdaniel
no flags Details

Description wdaniel 2013-06-28 19:56:27 UTC
Description of problem:

Both when the customer was running 3.1 and 3.2 he had one of his hosts go non-operational, and the only solution seemed to be restarting the host to clear a hung NFS task,


How reproducible:

Haven't attempted to reproduce in a test environment, but this happened to the customer multiple times seemingly regardless of software version


Steps to Reproduce:
1. Host is fully functional
2. Host attempts to contact storage domain
3. remoteFileHandler.py times out, and the host goes down

Actual results:
Host goes to non-operational state and has to be rebooted before displaying VM status correctly

Expected results:
Seamless communication, no interruption of service


Additional info:

Thread-62515::WARNING::2013-05-14 16:01:03,877::remoteFileHandler::185::Storage.CrabRPCProxy::(callCrabRPCFunction) Problem with handler, treating as timeout
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/remoteFileHandler.py", line 177, in callCrabRPCFunction
  File "/usr/share/vdsm/storage/remoteFileHandler.py", line 143, in _recvAll
Timeout
Thread-62515::ERROR::2013-05-14 16:01:03,878::sdc::150::Storage.StorageDomainCache::(_findDomain) Error while looking for domain `a3ebdb74-5120-4406-92a5-da34ec2d3089`
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/sdc.py", line 145, in _findDomain
  File "/usr/share/vdsm/storage/nfsSD.py", line 155, in findDomain
  File "/usr/share/vdsm/storage/fileSD.py", line 130, in __init__
  File "/usr/share/vdsm/storage/persistentDict.py", line 85, in __getitem__
  File "/usr/share/vdsm/storage/persistentDict.py", line 193, in __getitem__
  File "/usr/lib64/python2.6/contextlib.py", line 16, in __enter__
  File "/usr/share/vdsm/storage/persistentDict.py", line 147, in _accessWrapper
  File "/usr/share/vdsm/storage/persistentDict.py", line 224, in refresh
  File "/usr/share/vdsm/storage/fileSD.py", line 99, in readlines
  File "/usr/share/vdsm/storage/remoteFileHandler.py", line 286, in callCrabRPCFunction
Timeout: Operation stuck on remote handler
Thread-62515::DEBUG::2013-05-14 16:01:03,880::resourceManager::538::ResourceManager::(releaseResource) Trying to release resource 'Storage.3664c1fc-469a-11e2-a679-077225634133'
Thread-62515::DEBUG::2013-05-14 16:01:03,880::resourceManager::553::ResourceManager::(releaseResource) Released resource 'Storage.3664c1fc-469a-11e2-a679-077225634133' (0 active users)
Thread-62515::DEBUG::2013-05-14 16:01:03,881::resourceManager::558::ResourceManager::(releaseResource) Resource 'Storage.3664c1fc-469a-11e2-a679-077225634133' is free, finding out if anyone is waiting for it.
Thread-62515::DEBUG::2013-05-14 16:01:03,881::resourceManager::565::ResourceManager::(releaseResource) No one is waiting for resource 'Storage.3664c1fc-469a-11e2-a679-077225634133', Clearing records.
Thread-62515::ERROR::2013-05-14 16:01:03,882::task::853::TaskManager.Task::(_setError) Task=`5b1bd675-6ad0-457f-9842-e11deed3380a`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 861, in _run
  File "/usr/share/vdsm/logUtils.py", line 38, in wrapper
  File "/usr/share/vdsm/storage/hsm.py", line 862, in connectStoragePool
  File "/usr/share/vdsm/storage/hsm.py", line 904, in _connectStoragePool
  File "/usr/share/vdsm/storage/sp.py", line 648, in connect
  File "/usr/share/vdsm/storage/sp.py", line 1178, in __rebuild
  File "/usr/share/vdsm/storage/sp.py", line 1522, in getMasterDomain
StoragePoolMasterNotFound: Cannot find master domain: 'spUUID=3664c1fc-469a-11e2-a679-077225634133, msdUUID=a3ebdb74-5120-4406-92a5-da34ec2d3089'

Comment 1 Haim 2013-06-30 04:53:21 UTC
please attach SOS report.

Comment 2 wdaniel 2013-07-02 18:47:12 UTC
Created attachment 767917 [details]
sosreport 6/14

Comment 3 Saggi Mizrahi 2013-07-21 12:02:52 UTC
Everything seems to be working as expected. If the host has problems communicating with the NFS server things will obviously start to break.

We need to figure out why the NFS is not responding and fix that.
As long as VDSM can't read the metadata of the storage it can't connect
to the pool.

Comment 4 Barak 2013-07-21 14:03:56 UTC
Just to be clear,

The host attempts to connect to a storagePool but since it can not read the master domain (timed out) it fails to connect to the pool.

At this point the engine moves it to non-operational (this is standard) as this host is required to be connected to this pool.

So this happens due to the host inability to read the master domain, 
On standard flow the host recovery mechanism kicks in every 5 minutes and basically tries to redo connect to the storahePool.

So a few questions:
- does it happen to different hosts? (looks like yes)
- does it happen always on the same NFS server ? 
- did you try to access the NFS server manually from the host after it happened?
- did you try to restart VDSM instead of reboot - in case it fixes the problem 
  it may indicate a VDSM problem
- How much time did you wait before the reboot (to see if the host recovery 
  kicked in and  the assumed NFS problem went away)
 
Basically it looks like a connectivity problem to the NFS server, and this might happen due to many reasons.


Note You need to log in before you can comment on or make changes to this bug.