Bug 1390960

Summary: when one of the node goes to non responsive state status of vms residing on that host goes to UNKNOWN
Product: [oVirt] ovirt-engine Reporter: RamaKasturi <knarra>
Component: BLL.InfraAssignee: Martin Perina <mperina>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Pavel Stehlik <pstehlik>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 4.1.0CC: bugs, knarra, michal.skrivanek, mperina, oourfali, sabose, sasundar
Target Milestone: ---Flags: sabose: ovirt-4.1?
rule-engine: planning_ack?
rule-engine: devel_ack?
rule-engine: testing_ack?
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-02-03 07:28:51 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Infra RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1277939    

Description RamaKasturi 2016-11-02 09:57:27 UTC
Description of problem:
Have three nodes in the cluster rhs1,rhs2 and rhs3. Apply new gluster fencing policies "Skip fencing if bricks are up" and "skip Fencing if Gluster Quorum not met". One of the node(hs3) is moved to maintenance with glusterd service stopped on that node and other node (hs1) is  in non responsive because of network(ovirtmgmt) is down. I see that the host which is non responsive does not get fenced due to the gluster fencing policies, but i see that vms on the node goes to Unknown state until the host turns responsive.

Version-Release number of selected component (if applicable):
ovirt-engine-4.1.0-0.0.master.20161024211322.gitfc0de31.el7.centos.noarch
vdsm-4.18.999-779.giteb80305.el7.centos.x86_64


How reproducible:
Always

Steps to Reproduce:
1. Install HC cluster
2. configure power management
3. Apply new gluster fencing policies.
4. create new vm and mark that as highly available.
4. Out of three nodes, move one of the host into maintenance by stopping glusterd on that node.
5. Now bring down ovirtmgmt nic on one of the hosts and wait for host to move to non responsive state.

Actual results:
status of vms residing on that host goes to 'UNKNOWN' till the host becomes reponsive.

Expected results:
vms residing on that host should not go to UNKOWN state.

Additional info:

Comment 1 RamaKasturi 2016-11-02 10:28:10 UTC
sos reports can be found in the link below:

http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/HC/1390960/

Comment 2 Michal Skrivanek 2016-11-03 05:11:08 UTC
That is intentional. Unresponsive host is not responding hence the status of VMs cannot be known. Why is it a problem?

Comment 3 RamaKasturi 2016-11-03 10:08:20 UTC
Hi Michal,

Comment 4 RamaKasturi 2016-11-03 10:09:31 UTC
Hi Michal,
 
    AFAIU, when the vm is marked highly available and if the host on which vm resides goes down, vm should be automatically restarted on other host. Please correct me if i am wrong.

Thanks
kasturi.

Comment 5 RamaKasturi 2017-01-16 07:40:09 UTC
I have re-tested this with the latest 4.1 bits by running the steps below.

1. Install HC cluster
2. configure power management
3. Apply new gluster fencing policies.
4. create new vm and mark that as highly available.
5. Now bring down ovirtmgmt nic on one of the hosts and wait for host to move to non responsive state.

Actual results:
status of vms residing on that host goes to 'UNKNOWN' till the host becomes reponsive.

Expected results:
vms residing on that host should not go to UNKOWN state and I/O stops happening on the vm.

Comment 7 RamaKasturi 2017-01-17 10:04:54 UTC
1. First host goes to non-responsive.
2. Engine checks the status of host through PM.
3. Engine thinks host is rebooting. But I don't think its correct.  For some reason, fencing is skipped and engine thinks fencing is executed.

Note: 'Host <host> is rebooting' message is logged when fencing is executed on the host.

Comment 8 Oved Ourfali 2017-01-17 10:47:04 UTC
This is by definition, so perhaps the gluster logic to skip fencing isn't working well?
Can you give the relevant time as the logs span over a lot of time?

Comment 9 RamaKasturi 2017-01-17 11:06:07 UTC
This should be between 12.00 P.M - 2.00 P.M IST.

Comment 10 RamaKasturi 2017-01-17 11:38:38 UTC
please ignore comment 5,6,7 &9. This was supposed to go to another bug, it was my bad that i updated here.

Comment 11 RamaKasturi 2017-01-17 11:42:14 UTC
(In reply to RamaKasturi from comment #9)
> This should be between 12.00 P.M - 2.00 P.M IST.

Hi Oved,
  
    I ran this test case long time back and exactly not sure of what the time would be. But based on the bug logged i think it should be between 3.30P.M - 4.30 P.M.

Thanks
kasturi

Comment 12 Sahina Bose 2017-01-23 10:00:41 UTC
Can you retest this with fix for Bug 1413928

Comment 13 Martin Perina 2017-01-30 08:03:08 UTC
(In reply to Sahina Bose from comment #12)
> Can you retest this with fix for Bug 1413928

Any news?

Comment 14 Sahina Bose 2017-01-30 08:59:00 UTC
Sas, can you check this?

Comment 15 Oved Ourfali 2017-02-03 07:28:51 UTC
Closing. 
If the issue occurs again, please reopen.

Comment 16 SATHEESARAN 2019-04-27 02:33:30 UTC
(In reply to Sahina Bose from comment #14)
> Sas, can you check this?

I will retest this scenario with RHV 4.3 and RHGS 3.4.4