Red Hat Bugzilla – Bug 1310417
[Non-Operational Host] attempt to activate goes through a slow state change before failing
Last modified: 2016-12-07 02:22:33 EST
Description of problem:
[Non-Operational Host] - It is possible to activate a non-operational host(as a result of a missing required network on host), although it shouldn't be possible.
For example, if we have a situation when host moved to non-operational state because one of the interfaces with a required network is down -->
"Host orchid-vds1.qa.lab.tlv.redhat.com moved to Non-Operational state because interfaces which are down are needed by required networks in the current cluster: 'ens1f1 (req5)'."
In such case i shouldn't be able to activate the host, but i can.
On the next monitor cycle host will be moved again to Non-operational state.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. Create a required network(required in cluster) and attach to NIC on host via setup networks
2. Set the interface to which the network attached to down - ip link set down 'nic_name'
3. Wait until the host moved to Non-Operational state and activate it
Host set to UP, although it missing a required network.
Shouldn't be possible to activate host in such scenario.
Maybe this bug should moved to Infra, i tested(with your help) 2 scenarios that are not relevant to Network.
1) Reject traffic to/from storage domain --> after few minutes host moved to Non-operational, when pressing on 'Active' button, engine trying to activate the host and activation initiated, then host moving to 'connecting' state and after few minutes failing with vdsm command failed(timeout).
2) Move host 3.5 from cluster 3.5 to cluster 3.6, when pressing the 'Active' button, host moving to UP state and after few seconds to 'connecting' state(host is not compatible with cluster's version) for ever(there was no time out and stuck in endless 'connecting' state).
All this 3 scenarios^^ (including the required networks) should prevent from trying and activate the host, there is a actual reason for why the host is in a Non-operational state.
Moving from 4.0 alpha to 4.0 beta since 4.0 alpha has been already released and bug is not ON_QA.
oVirt 4.0 beta has been released, moving to RC milestone.
Actually, by design you can do that.
Assuming you fixed the non operational reason, by doing it offline, and now you want to go ahead and try to activate it.
If the issues were fixed then the host will eventually move to up status.
If not, it will go back to non operational.
Oved, our problem is that in the described scenario the host shouldn't go UP.
If host missing a required network, then it shouldn't be UP.
And it goes UP, this is the problem here, this is the point of the required network in cluster.
Engine shouldn't allow the server go up in case it missing the required network.
In our case the issue wasn't fixed and it still goes UP.
We are not agree with closing this report.
Reopening for reconsideration due to comment #6 + the impact on CI stability.
Can you explain the ci impact?
This has been that way forever, and changing the host life cycle is very problematic.
We move to up, communicate with the host, and check different network and storage status.
This was that way for a few versions now.... Since 2.2.
(In reply to Oved Ourfali from comment #7)
> Can you explain the ci impact?
> This has been that way forever, and changing the host life cycle is very
> We move to up, communicate with the host, and check different network and
> storage status.
> This was that way for a few versions now.... Since 2.2.
Meni, could you please provide the impact on the network testing in CI?
In out tier1 tests we check 'required network' (host should be non-operational if required network is down) and from time to time the test fail because the host can be in UP state.
Tier1 run on CI.
Anyway this is not how it should be even if this is like that since 2.2, all status check on the host should be done is his current state (non-operational) and after that decide if the host should be up.
Please sit with me tomorrow on the test and let's see if we can make it work somehow.
Whether it should behave this or that way is negotiable, but changing it might have implications. I'll check it anyway again. But even if it happens it will not be anytime soon, so let's look at the test.
I looked at it again with Dan and Moti.
The issue is that the interface is there, but down.
When you activate a host, if the interface is there then the host doesn't move to non-operational.
After it moves to up we start gathering statistics, and only then we decide to move the host to non-operational.
If, however, the interface won't exist at all then the host won't move to up in the first place.
We agreed that calling stats in the enforceNetworkCompliance method of the HostNetworkTopologyPersisterImpl is risky.
Closing this as wontfix, and moving to network, as if you re-open they should consider doing the above (which I highly recommend not to do...)