Bug 1561522

Summary: Activated host is marked as connecting because of unreachable storage domain.
Product: Red Hat Enterprise Virtualization Manager Reporter: Roman Hodain <rhodain>
Component: ovirt-engineAssignee: Nobody <nobody>
Status: CLOSED DUPLICATE QA Contact: Elad <ebenahar>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 4.2.1CC: gveitmic, lsurette, Rhev-m-bugs, rhodain, srevivo
Target Milestone: ovirt-4.3.0Flags: lsvaty: testing_plan_complete-
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-12-19 00:47:20 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Roman Hodain 2018-03-28 13:46:21 UTC
Description of problem:
When activating a hypervisor with more than one NFS storage domain configurated for the datacenter and the connection to the storage is broker than the host is flipping between "Up" and "not responding". That causes the WebAdmin to show the host as connecting for a long time. This is a problem for automation tasks. For example, activating the host via ansible roles fails and it is almost impossible to properly handle the error messages as the host cannot be activated or put into the maintenance mode. 

Version-Release number of selected component (if applicable):
rhvm-4.2.1.6-0.1.el7.noarch

How reproducible:
100%

Steps to Reproduce:
1) deploy one host 
2) Create two NFS SDs
3) Deactivate the host 
4) Block the communication on the NFS share 
    iptables -I INPUT -s ${Hpypervisor_IP} -j REJECT
4) Activate teh host

Actual results:
The host stays in the connecting state for a long time.

Expected results:
The host is marked as non-operational after a defined amount of time.

Additional info:
The engine sends repetitively ConnectStorageServerVDSCommand which returns VDSNetworkException as it takes a long time. Rhe issue does not occur when only one storage domain is connected. 

This is also happening on previous versions.

Comment 1 Yaniv Lavi 2018-08-13 08:52:20 UTC
The regular timeout for NFS mount is 70 seconds.
Is this the time the host is stuck in connecting?

Comment 2 Yaniv Lavi 2018-08-13 08:57:20 UTC
*** Bug 1580243 has been marked as a duplicate of this bug. ***

Comment 3 Germano Veit Michel 2018-08-14 00:12:23 UTC
(In reply to Yaniv Lavi from comment #1)
> The regular timeout for NFS mount is 70 seconds.
> Is this the time the host is stuck in connecting?

Hi Yaniv,

I think the other bug, which you set as duplicate, had a bit more info and some discussions already done.

Anyway, this is not just NFS and not only caused by NFS timeout mount. ConnectStorageServerVDSCommand can take longer, due to network(TCP)/storage/... delays. If this happens, the engine throws VDSNetworkException and tries again. And again, and again, in a loop. The host is always on connecting -> not responding -> connecting dance.

The correct status would be Non-Operational, not the Connecting->NotResponding dance. Or as Nir suggested on the other bug, this could be async.

Comment 5 Germano Veit Michel 2018-12-19 00:47:20 UTC

*** This bug has been marked as a duplicate of bug 1580243 ***