Description of problem: Have three nodes in the cluster rhs1,rhs2 and rhs3. Apply new gluster fencing policies "Skip fencing if bricks are up" and "skip Fencing if Gluster Quorum not met". One of the node(hs3) is moved to maintenance with glusterd service stopped on that node and other node (hs1) is in non responsive because of network(ovirtmgmt) is down. I see that the host which is non responsive does not get fenced due to the gluster fencing policies, but i see that vms on the node goes to Unknown state until the host turns responsive. Version-Release number of selected component (if applicable): ovirt-engine-4.1.0-0.0.master.20161024211322.gitfc0de31.el7.centos.noarch vdsm-4.18.999-779.giteb80305.el7.centos.x86_64 How reproducible: Always Steps to Reproduce: 1. Install HC cluster 2. configure power management 3. Apply new gluster fencing policies. 4. create new vm and mark that as highly available. 4. Out of three nodes, move one of the host into maintenance by stopping glusterd on that node. 5. Now bring down ovirtmgmt nic on one of the hosts and wait for host to move to non responsive state. Actual results: status of vms residing on that host goes to 'UNKNOWN' till the host becomes reponsive. Expected results: vms residing on that host should not go to UNKOWN state. Additional info:
sos reports can be found in the link below: http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/HC/1390960/
That is intentional. Unresponsive host is not responding hence the status of VMs cannot be known. Why is it a problem?
Hi Michal,
Hi Michal, AFAIU, when the vm is marked highly available and if the host on which vm resides goes down, vm should be automatically restarted on other host. Please correct me if i am wrong. Thanks kasturi.
I have re-tested this with the latest 4.1 bits by running the steps below. 1. Install HC cluster 2. configure power management 3. Apply new gluster fencing policies. 4. create new vm and mark that as highly available. 5. Now bring down ovirtmgmt nic on one of the hosts and wait for host to move to non responsive state. Actual results: status of vms residing on that host goes to 'UNKNOWN' till the host becomes reponsive. Expected results: vms residing on that host should not go to UNKOWN state and I/O stops happening on the vm.
1. First host goes to non-responsive. 2. Engine checks the status of host through PM. 3. Engine thinks host is rebooting. But I don't think its correct. For some reason, fencing is skipped and engine thinks fencing is executed. Note: 'Host <host> is rebooting' message is logged when fencing is executed on the host.
This is by definition, so perhaps the gluster logic to skip fencing isn't working well? Can you give the relevant time as the logs span over a lot of time?
This should be between 12.00 P.M - 2.00 P.M IST.
please ignore comment 5,6,7 &9. This was supposed to go to another bug, it was my bad that i updated here.
(In reply to RamaKasturi from comment #9) > This should be between 12.00 P.M - 2.00 P.M IST. Hi Oved, I ran this test case long time back and exactly not sure of what the time would be. But based on the bug logged i think it should be between 3.30P.M - 4.30 P.M. Thanks kasturi
Can you retest this with fix for Bug 1413928
(In reply to Sahina Bose from comment #12) > Can you retest this with fix for Bug 1413928 Any news?
Sas, can you check this?
Closing. If the issue occurs again, please reopen.
(In reply to Sahina Bose from comment #14) > Sas, can you check this? I will retest this scenario with RHV 4.3 and RHGS 3.4.4