Bug 1413928
Summary: | If host, where hosted engine VM is running, become NonResponsive, it's not properly fenced, so HA VMs executed on that host are not restarted automatically on different host | |||
---|---|---|---|---|
Product: | [oVirt] ovirt-engine | Reporter: | RamaKasturi <knarra> | |
Component: | BLL.Infra | Assignee: | Martin Perina <mperina> | |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Petr Matyáš <pmatyas> | |
Severity: | urgent | Docs Contact: | ||
Priority: | high | |||
Version: | 4.1.0 | CC: | bugs, cshao, dfediuck, dguo, huzhao, jiawu, mgoldboi, oourfali, pstehlik, qiyuan, rnachimu, weiwang, yaniwang, ycui, yzhao | |
Target Milestone: | ovirt-4.1.0-rc | Keywords: | Regression | |
Target Release: | 4.1.0 | Flags: | rule-engine:
ovirt-4.1+
rule-engine: blocker+ mgoldboi: planning_ack+ mperina: devel_ack+ pstehlik: testing_ack+ |
|
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | If docs needed, set a value | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1434957 (view as bug list) | Environment: | ||
Last Closed: | 2017-02-01 14:58:18 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | Infra | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | 1409203, 1417196 | |||
Bug Blocks: | 1277939, 1434957 |
Description
RamaKasturi
2017-01-17 10:54:36 UTC
Event Message from Engine: =============================== Executing power management status on Host hosted_engine1 using Proxy Host hosted_engine2 and Fence Agent ipmilan:rhsqa-grafton1-mm.lab.eng.blr.redhat.com. Attaching logs from host hosted_engine2(proxy Host) which is rhsqa-grafton2.lab.eng.blr.redhat.com and engine logs from HE VM. http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/HC/1413928 1. First host goes to non-responsive. 2. Engine checks the status of host through PM. 3. Engine thinks host is rebooting. But I don't think its correct. For some reason, fencing is skipped and engine thinks fencing is executed. Note: 'Host <host> is rebooting' message is logged when fencing is executed on the host. (In reply to Oved Ourfali from bz#1390960#c8) > This is by definition, so perhaps the gluster logic to skip fencing isn't > working well? I think fencing policy is applied only during one of the following fence actions, 'on', 'off', 'reboot' . But I am not seeing any of these fence action calls in vdsm/engine log. Only 'fence status' is being executed and engine says Host <X> is rebooting'. If understand getAuditLogTypeValue() method in VdsNotRespondingTreatmentCommand correctly, this message should appear only when fencing is executed successfully. @Override public AuditLogType getAuditLogTypeValue() { return getSucceeded() ? AuditLogType.VDS_RECOVER : AuditLogType.VDS_RECOVER_FAILED; } > Can you give the relevant time as the logs span over a lot of time? So I've looked at logs here are results: 1. ovirt-engine instance was started 2017-01-17 04:25:50,998-05 INFO [org.ovirt.engine.core.bll.ServiceLoader] (ServerService Thread Pool -- 46) [] Start org.ovirt.engine.core.bll.dwh.DwhHeartBeat@655469f4 2. hosted_engine1 changed status to Connecting - correct 2017-01-17 04:25:54,320-05 WARN [org.ovirt.engine.core.vdsbroker.VdsManager] (org.ovirt.thread.pool-6-thread-7) [] Host 'hosted_engine1' is not responding. It will stay in Connecting state for a grace period of 81 seconds and after that an attempt to fence the host will be issued. 3. hosted_engine1 changed status to NonResponsive and SSH Soft Fencing executed - correct 2017-01-17 04:26:06,907-05 INFO [org.ovirt.engine.core.bll.pm.SshSoftFencingCommand] (org.ovirt.thread.pool-6-thread-14) [32aea450] Running command: SshSoftFencingCommand internal: true. Entities affected : ID: 58706c10-4501-4435-bf82-2181f7a0cdab Type: VDS 4. SSH Soft Fencing failed and Kdump detection started - correct 2017-01-17 04:26:08,335-05 INFO [org.ovirt.engine.core.bll.pm.VdsKdumpDetectionCommand] (org.ovirt.thread.pool-6-thread-14) [32aea450] Running command: VdsKdumpDetectionCommand internal: true. Entities affected : ID: 58706c10-4501-4435-bf82-2181f7a0cdab Type: VDS 5. Kdump flow not detected and power management restart initiated - correct 2017-01-17 04:26:38,397-05 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-14) [32aea450] Correlation ID: 32aea450, Call Stack: null, Custom Event ID: -1, Message: Kdump flow is not in progress on host hosted_engine1. 6. Power management restart failed, because fencing is disabled until 5 minutes after ovirt-engine startup passed - correct 2017-01-17 04:26:38,430-05 WARN [org.ovirt.engine.core.bll.pm.RestartVdsCommand] (org.ovirt.thread.pool-6-thread-14) [] Validation of action 'RestartVds' failed for user SYSTEM. Reasons: VAR__ACTION__RESTART,VDS_FENCE_DISABLED_AT_SYSTEM_STARTUP_INTERVAL,VDS_FENCE_OPERATION_FAILED,VAR__TYPE__HOST,VAR__ACTION__RESTART 7. We issues power management restart was susccessfull - ERROR 2017-01-17 04:26:38,438-05 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-14) [] Correlation ID: 32aea450, Job ID: 64634553-68bb-4b6c-a06d-ed1b97914be1, Call Stack: null, Custom Event ID: -1, Message: Host hosted_engine1 is rebooting. So host was not fenced because fencing is disabled in 5 minutes interval after engine startup (this prevents fencing storms and interval can be changed in engine-config using DisableFenceAtStartupInSec option). Which is correct, but we have an exception for hosted engine as tracked by BZ1266099. Unfortunately this exception was erroneously removed in refactoring patch [1] and broke fencing in hosted engine. [1] https://gerrit.ovirt.org/#/c/59916/ Verified on 4.1.0-9 Having poweroff -f the host on which all VMs were running (HE VM, HA VM and non HA VM) the engine is restarted on second host, HA VM is restarted on second host right after that. Then the first host is fenced. With an exception of ovirt-hosted-engine-setup where I used 2.1.0.1-1 so I can actually install HE, but that should not have any effect on verification of this. |