Bug 1310417 - [Non-Operational Host] attempt to activate goes through a slow state change before failing
Summary: [Non-Operational Host] attempt to activate goes through a slow state change b...
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: BLL.Infra
Version: 3.6.3.2
Hardware: x86_64
OS: Linux
low
medium
Target Milestone: ---
: ---
Assignee: Oved Ourfali
QA Contact: Pavel Stehlik
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-02-21 12:25 UTC by Michael Burman
Modified: 2016-12-07 07:22 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-12-07 07:22:33 UTC
oVirt Team: Network
Embargoed:


Attachments (Terms of Use)

Description Michael Burman 2016-02-21 12:25:25 UTC
Description of problem:
[Non-Operational Host] - It is possible to activate a non-operational host(as a result of a missing required network on host), although it shouldn't be possible.

For example, if we have a situation when host moved to non-operational state because one of the interfaces with a required network is down -->
 
"Host orchid-vds1.qa.lab.tlv.redhat.com moved to Non-Operational state because interfaces which are down are needed by required networks in the current cluster: 'ens1f1 (req5)'."

"status='NonOperational', nonOperationalReason='NETWORK_INTERFACE_IS_DOWN'"

In such case i shouldn't be able to activate the host, but i can. 
On the next monitor cycle host will be moved again to Non-operational state.

Version-Release number of selected component (if applicable):
3.6.3.2-0.1.el6

How reproducible:
100

Steps to Reproduce:
1. Create a required network(required in cluster) and attach to NIC on host via setup networks
2. Set the interface to which the network attached to down - ip link set down 'nic_name'
3. Wait until the host moved to Non-Operational state and activate it

Actual results:
Host set to UP, although it missing a required network.

Expected results:
Shouldn't be possible to activate host in such scenario.

Comment 1 Michael Burman 2016-02-22 13:36:04 UTC
Hi Dan,

Maybe this bug should moved to Infra, i tested(with your help) 2 scenarios that are not relevant to Network. 

For example: 

1) Reject traffic to/from storage domain --> after few minutes host moved to Non-operational, when pressing on 'Active' button, engine trying to activate the host and activation initiated, then host moving to 'connecting' state and after few minutes failing with vdsm command failed(timeout).

2) Move host 3.5 from cluster 3.5 to cluster 3.6, when pressing the 'Active' button, host moving to UP state and after few seconds to 'connecting' state(host is not compatible with cluster's version) for ever(there was no time out and stuck in endless 'connecting' state).    

All this 3 scenarios^^ (including the required networks) should prevent from trying and activate the host, there is a actual reason for why the host is in a Non-operational state. 

Thanks,

Comment 2 Sandro Bonazzola 2016-05-02 09:54:09 UTC
Moving from 4.0 alpha to 4.0 beta since 4.0 alpha has been already released and bug is not ON_QA.

Comment 3 Yaniv Lavi 2016-05-23 13:16:17 UTC
oVirt 4.0 beta has been released, moving to RC milestone.

Comment 4 Oved Ourfali 2016-12-05 19:25:20 UTC
Actually, by design you can do that. 
Assuming you fixed the non operational reason, by doing it offline, and now you want to go ahead and try to activate it. 

If the issues were fixed then the host will eventually move to up status. 

If not, it will go back to non operational.

Comment 5 Michael Burman 2016-12-06 11:56:03 UTC
Oved, our problem is that in the described scenario the host shouldn't go UP.
If host missing a required network, then it shouldn't be UP. 
And it goes UP, this is the problem here, this is the point of the required network in cluster. 
Engine shouldn't allow the server go up in case it missing the required network.

In our case the issue wasn't fixed and it still goes UP.
We are not agree with closing this report.

Comment 6 Gil Klein 2016-12-06 17:06:29 UTC
Reopening for reconsideration due to comment #6 + the impact on CI stability.

Comment 7 Oved Ourfali 2016-12-06 17:44:57 UTC
Can you explain the ci impact? 
This has been that way forever, and changing the host life cycle is very problematic. 
We move to up, communicate with the host, and check different network and storage status. 
This was that way for a few versions now.... Since 2.2.

Comment 8 Gil Klein 2016-12-06 17:49:59 UTC
(In reply to Oved Ourfali from comment #7)
> Can you explain the ci impact? 
> This has been that way forever, and changing the host life cycle is very
> problematic. 
> We move to up, communicate with the host, and check different network and
> storage status. 
> This was that way for a few versions now.... Since 2.2.
Meni, could you please provide the impact on the network testing in CI?

Comment 9 Meni Yakove 2016-12-06 17:52:23 UTC
In out tier1 tests we check 'required network' (host should be non-operational if required network is down) and from time to time the test fail because the host can be in UP state.

Tier1 run on CI.

Anyway this is not how it should be even if this is like that since 2.2, all status check on the host should be done is his current state (non-operational) and after that decide if the host should be up.

Comment 10 Oved Ourfali 2016-12-06 18:04:00 UTC
Please sit with me tomorrow on the test and let's see if we can make it work somehow. 

Whether it should behave this or that way is negotiable, but changing it might have implications. I'll check it anyway again. But even if it happens it will not be anytime soon, so let's look at the test.

Comment 11 Oved Ourfali 2016-12-07 07:22:33 UTC
I looked at it again with Dan and Moti.
The issue is that the interface is there, but down.
When you activate a host, if the interface is there then the host doesn't move to non-operational.

After it moves to up we start gathering statistics, and only then we decide to move the host to non-operational.

If, however, the interface won't exist at all then the host won't move to up in the first place.

We agreed that calling stats in the enforceNetworkCompliance method of the HostNetworkTopologyPersisterImpl is risky.

Closing this as wontfix, and moving to network, as if you re-open they should consider doing the above (which I highly recommend not to do...)


Note You need to log in before you can comment on or make changes to this bug.