Bug 1413928

Summary: If host, where hosted engine VM is running, become NonResponsive, it's not properly fenced, so HA VMs executed on that host are not restarted automatically on different host
Product: [oVirt] ovirt-engine Reporter: RamaKasturi <knarra>
Component: BLL.InfraAssignee: Martin Perina <mperina>
Status: CLOSED CURRENTRELEASE QA Contact: Petr Matyáš <pmatyas>
Severity: urgent Docs Contact:
Priority: high    
Version: 4.1.0CC: bugs, cshao, dfediuck, dguo, huzhao, jiawu, mgoldboi, oourfali, pstehlik, qiyuan, rnachimu, weiwang, yaniwang, ycui, yzhao
Target Milestone: ovirt-4.1.0-rcKeywords: Regression
Target Release: 4.1.0Flags: rule-engine: ovirt-4.1+
rule-engine: blocker+
mgoldboi: planning_ack+
mperina: devel_ack+
pstehlik: testing_ack+
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1434957 (view as bug list) Environment:
Last Closed: 2017-02-01 14:58:18 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Infra RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1409203, 1417196    
Bug Blocks: 1277939, 1434957    

Description RamaKasturi 2017-01-17 10:54:36 UTC
Description of problem:

Installed HC stack with power management enabled on all the three nodes, created vms and marked them highly available. When one of the host is powered off by running the command 'poweroff -f'  i see that vms residing on that host goes to unknown state and host goes to non-responsive. Engine checks the status of the host through PM and it thinks it is rebooting but for some reason fencing is skipped and engine thinks fencing is executed.

Version-Release number of selected component (if applicable):
ovirt-engine-4.1.0-0.3.beta2.el7.noarch

How reproducible:
Always

Steps to Reproduce:
1. Install HC
2. Configure power management on all the hosts.
3. Now create an app vm and mark it highly available.
4. simulate power cable pull scenario by running the command 'poweroff -f' on the node where HE and app vms are running.

Actual results:
Vms running on the node which is powered off goes to unknown state and host goes to non-responsive. Engine for some reason does not fence the host and thinks fencing is executed.

Expected results:
VMs running on the node which is powered off should migrate to another node since they are marked highly available and since power management is configured engine should fence the host.

Additional info:

Comment 1 RamaKasturi 2017-01-17 10:56:13 UTC
Event Message from Engine:
===============================

Executing power management status on Host hosted_engine1 using Proxy Host hosted_engine2 and Fence Agent ipmilan:rhsqa-grafton1-mm.lab.eng.blr.redhat.com.

Attaching logs from host hosted_engine2(proxy Host) which is rhsqa-grafton2.lab.eng.blr.redhat.com and engine logs from HE VM.

http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/HC/1413928

Comment 2 RamaKasturi 2017-01-17 10:56:38 UTC
1. First host goes to non-responsive.
2. Engine checks the status of host through PM.
3. Engine thinks host is rebooting. But I don't think its correct.  For some reason, fencing is skipped and engine thinks fencing is executed.

Note: 'Host <host> is rebooting' message is logged when fencing is executed on the host.

Comment 3 Ramesh N 2017-01-17 11:47:01 UTC
(In reply to Oved Ourfali from bz#1390960#c8)
> This is by definition, so perhaps the gluster logic to skip fencing isn't
> working well?

I think fencing policy is applied only during one of the following fence actions, 'on', 'off', 'reboot' . But I am not seeing any of these fence action calls in vdsm/engine log. Only 'fence status' is being executed and engine says Host <X> is rebooting'. If understand getAuditLogTypeValue() method in VdsNotRespondingTreatmentCommand correctly, this message should appear only when fencing is executed successfully.

@Override
public AuditLogType getAuditLogTypeValue() {
        return getSucceeded() ? AuditLogType.VDS_RECOVER : AuditLogType.VDS_RECOVER_FAILED;
    }


> Can you give the relevant time as the logs span over a lot of time?

Comment 4 Martin Perina 2017-01-18 15:08:15 UTC
So I've looked at logs here are results:

1. ovirt-engine instance was started
2017-01-17 04:25:50,998-05 INFO  [org.ovirt.engine.core.bll.ServiceLoader] (ServerService Thread Pool -- 46) [] Start org.ovirt.engine.core.bll.dwh.DwhHeartBeat@655469f4

2. hosted_engine1 changed status to Connecting - correct
2017-01-17 04:25:54,320-05 WARN  [org.ovirt.engine.core.vdsbroker.VdsManager] (org.ovirt.thread.pool-6-thread-7) [] Host 'hosted_engine1' is not responding. It will stay in Connecting state for a grace period of 81 seconds and after that an attempt to fence the host will be issued.

3. hosted_engine1 changed status to NonResponsive and SSH Soft Fencing executed - correct
2017-01-17 04:26:06,907-05 INFO  [org.ovirt.engine.core.bll.pm.SshSoftFencingCommand] (org.ovirt.thread.pool-6-thread-14) [32aea450] Running command: SshSoftFencingCommand internal: true. Entities affected :  ID: 58706c10-4501-4435-bf82-2181f7a0cdab Type: VDS

4. SSH Soft Fencing failed and Kdump detection started - correct
2017-01-17 04:26:08,335-05 INFO  [org.ovirt.engine.core.bll.pm.VdsKdumpDetectionCommand] (org.ovirt.thread.pool-6-thread-14) [32aea450] Running command: VdsKdumpDetectionCommand internal: true. Entities affected :  ID: 58706c10-4501-4435-bf82-2181f7a0cdab Type: VDS


5. Kdump flow not detected and power management restart initiated - correct
2017-01-17 04:26:38,397-05 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-14) [32aea450] Correlation ID: 32aea450, Call Stack: null, Custom Event ID: -1, Message: Kdump flow is not in progress on host hosted_engine1. 

6. Power management restart failed, because fencing is disabled until 5 minutes after ovirt-engine startup passed - correct
2017-01-17 04:26:38,430-05 WARN  [org.ovirt.engine.core.bll.pm.RestartVdsCommand] (org.ovirt.thread.pool-6-thread-14) [] Validation of action 'RestartVds' failed for user SYSTEM. Reasons: VAR__ACTION__RESTART,VDS_FENCE_DISABLED_AT_SYSTEM_STARTUP_INTERVAL,VDS_FENCE_OPERATION_FAILED,VAR__TYPE__HOST,VAR__ACTION__RESTART

7. We issues power management restart was susccessfull - ERROR
2017-01-17 04:26:38,438-05 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-14) [] Correlation ID: 32aea450, Job ID: 64634553-68bb-4b6c-a06d-ed1b97914be1, Call Stack: null, Custom Event ID: -1, Message: Host hosted_engine1 is rebooting.


So host was not fenced because fencing is disabled in 5 minutes interval after engine startup (this prevents fencing storms and interval can be changed in engine-config using DisableFenceAtStartupInSec option). Which is correct, but we have an exception for hosted engine as tracked by BZ1266099. Unfortunately this exception was erroneously removed in refactoring patch [1] and broke fencing in hosted engine.


[1] https://gerrit.ovirt.org/#/c/59916/

Comment 6 Petr Matyáš 2017-01-27 15:44:04 UTC
Verified on 4.1.0-9

Having poweroff -f the host on which all VMs were running (HE VM, HA VM and non HA VM) the engine is restarted on second host, HA VM is restarted on second host right after that. Then the first host is fenced.

With an exception of ovirt-hosted-engine-setup where I used 2.1.0.1-1 so I can actually install HE, but that should not have any effect on verification of this.