Bug 1390960 - when one of the node goes to non responsive state status of vms residing on that host goes to UNKNOWN
Summary: when one of the node goes to non responsive state status of vms residing on t...
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: BLL.Infra
Version: 4.1.0
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: ---
Assignee: Martin Perina
QA Contact: Pavel Stehlik
URL:
Whiteboard:
Depends On:
Blocks: Gluster-HC-2
TreeView+ depends on / blocked
 
Reported: 2016-11-02 09:57 UTC by RamaKasturi
Modified: 2019-04-27 02:37 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-02-03 07:28:51 UTC
oVirt Team: Infra
Embargoed:
sabose: ovirt-4.1?
rule-engine: planning_ack?
rule-engine: devel_ack?
rule-engine: testing_ack?


Attachments (Terms of Use)

Description RamaKasturi 2016-11-02 09:57:27 UTC
Description of problem:
Have three nodes in the cluster rhs1,rhs2 and rhs3. Apply new gluster fencing policies "Skip fencing if bricks are up" and "skip Fencing if Gluster Quorum not met". One of the node(hs3) is moved to maintenance with glusterd service stopped on that node and other node (hs1) is  in non responsive because of network(ovirtmgmt) is down. I see that the host which is non responsive does not get fenced due to the gluster fencing policies, but i see that vms on the node goes to Unknown state until the host turns responsive.

Version-Release number of selected component (if applicable):
ovirt-engine-4.1.0-0.0.master.20161024211322.gitfc0de31.el7.centos.noarch
vdsm-4.18.999-779.giteb80305.el7.centos.x86_64


How reproducible:
Always

Steps to Reproduce:
1. Install HC cluster
2. configure power management
3. Apply new gluster fencing policies.
4. create new vm and mark that as highly available.
4. Out of three nodes, move one of the host into maintenance by stopping glusterd on that node.
5. Now bring down ovirtmgmt nic on one of the hosts and wait for host to move to non responsive state.

Actual results:
status of vms residing on that host goes to 'UNKNOWN' till the host becomes reponsive.

Expected results:
vms residing on that host should not go to UNKOWN state.

Additional info:

Comment 1 RamaKasturi 2016-11-02 10:28:10 UTC
sos reports can be found in the link below:

http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/HC/1390960/

Comment 2 Michal Skrivanek 2016-11-03 05:11:08 UTC
That is intentional. Unresponsive host is not responding hence the status of VMs cannot be known. Why is it a problem?

Comment 3 RamaKasturi 2016-11-03 10:08:20 UTC
Hi Michal,

Comment 4 RamaKasturi 2016-11-03 10:09:31 UTC
Hi Michal,
 
    AFAIU, when the vm is marked highly available and if the host on which vm resides goes down, vm should be automatically restarted on other host. Please correct me if i am wrong.

Thanks
kasturi.

Comment 5 RamaKasturi 2017-01-16 07:40:09 UTC
I have re-tested this with the latest 4.1 bits by running the steps below.

1. Install HC cluster
2. configure power management
3. Apply new gluster fencing policies.
4. create new vm and mark that as highly available.
5. Now bring down ovirtmgmt nic on one of the hosts and wait for host to move to non responsive state.

Actual results:
status of vms residing on that host goes to 'UNKNOWN' till the host becomes reponsive.

Expected results:
vms residing on that host should not go to UNKOWN state and I/O stops happening on the vm.

Comment 7 RamaKasturi 2017-01-17 10:04:54 UTC
1. First host goes to non-responsive.
2. Engine checks the status of host through PM.
3. Engine thinks host is rebooting. But I don't think its correct.  For some reason, fencing is skipped and engine thinks fencing is executed.

Note: 'Host <host> is rebooting' message is logged when fencing is executed on the host.

Comment 8 Oved Ourfali 2017-01-17 10:47:04 UTC
This is by definition, so perhaps the gluster logic to skip fencing isn't working well?
Can you give the relevant time as the logs span over a lot of time?

Comment 9 RamaKasturi 2017-01-17 11:06:07 UTC
This should be between 12.00 P.M - 2.00 P.M IST.

Comment 10 RamaKasturi 2017-01-17 11:38:38 UTC
please ignore comment 5,6,7 &9. This was supposed to go to another bug, it was my bad that i updated here.

Comment 11 RamaKasturi 2017-01-17 11:42:14 UTC
(In reply to RamaKasturi from comment #9)
> This should be between 12.00 P.M - 2.00 P.M IST.

Hi Oved,
  
    I ran this test case long time back and exactly not sure of what the time would be. But based on the bug logged i think it should be between 3.30P.M - 4.30 P.M.

Thanks
kasturi

Comment 12 Sahina Bose 2017-01-23 10:00:41 UTC
Can you retest this with fix for Bug 1413928

Comment 13 Martin Perina 2017-01-30 08:03:08 UTC
(In reply to Sahina Bose from comment #12)
> Can you retest this with fix for Bug 1413928

Any news?

Comment 14 Sahina Bose 2017-01-30 08:59:00 UTC
Sas, can you check this?

Comment 15 Oved Ourfali 2017-02-03 07:28:51 UTC
Closing. 
If the issue occurs again, please reopen.

Comment 16 SATHEESARAN 2019-04-27 02:33:30 UTC
(In reply to Sahina Bose from comment #14)
> Sas, can you check this?

I will retest this scenario with RHV 4.3 and RHGS 3.4.4


Note You need to log in before you can comment on or make changes to this bug.