Bug 1099134

Summary: HA vms restart failed after power outage
Product: [Retired] oVirt Reporter: Yuriy Demchenko <demchenko.ya>
Component: ovirt-engine-coreAssignee: Eldan Shachar <eshachar>
Status: CLOSED CURRENTRELEASE QA Contact: Pavel Stehlik <pstehlik>
Severity: low Docs Contact:
Priority: unspecified    
Version: 3.3CC: bugs, demchenko.ya, dfediuck, gklein, michal.skrivanek, ofrenkel, rbalakri, yeylon
Target Milestone: ---   
Target Release: 3.6.0   
Hardware: x86_64   
OS: Linux   
Whiteboard: virt
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-07-30 12:57:39 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Virt RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
engine log
none
srv12 vdsm log
none
srv19 vdsm log none

Description Yuriy Demchenko 2014-05-19 14:21:33 UTC
Created attachment 897181 [details]
engine log

Description of problem:
After power outage restart of HA-vms failed with error 'VdcBLLException: java.lang.NullPointerException (Failed with
error ENGINE and code 5001)'

Version-Release number of selected component (if applicable):
centos 6.5 on engine&hosts
FC SAN storage
ovirt-engine-3.3.2-1.el6.noarch
vdsm-4.13.2-1.el6.x86_64

How reproducible:
dont know yet

Actual results:
HA vms in state 'down', no additional restart attempts

Expected results:
HA vms successfully restarted

Additional info:
i had a double power outage, first time power went back after 00:19, at 00:23 engine booted up and there were HA vm restart attempts that failed with error above. Second event were from 00:26 to 10:09 and after second time engine booted - only 2 vms from 46 were restarted, although all of them were configured as HA.

for example, vm 'prod.gui' timeline:
host srv19 were allready up when engine booted
2014-05-17 00:23:10,552 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-14) Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: State was set to Up for host srv19.
2014-05-17 00:23:10,565 INFO  [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-14) vm prod.gui running in db and not running in vds - add to rerun treatment. vds srv19
2014-05-17 00:23:10,909 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-14) [2989840c] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Highly Available VM prod.gui failed. It will be restarted automatically.
2014-05-17 00:23:10,911 INFO  [org.ovirt.engine.core.bll.VdsEventListener] (DefaultQuartzScheduler_Worker-14) [2989840c] Highly Available VM went down. Attempting to restart. VM Name: prod.gui, VM Id:bbb7a605-d511-461d-99d2-c5a5bf8d9958
2014-05-17 00:23:11,569 WARN  [org.ovirt.engine.core.bll.RunVmCommand] (DefaultQuartzScheduler_Worker-14) [2607ded5] Failed to run VM prod.gui: VdcBLLException: java.lang.NullPointerException (Failed with error ENGINE and code 5001)
2014-05-17 00:23:11,569 INFO  [org.ovirt.engine.core.bll.RunVmCommand] (DefaultQuartzScheduler_Worker-14) [2607ded5] Lock freed to object EngineLock [exclusiveLocks= key: bbb7a605-d511-461d-99d2-c5a5bf8d9958 value: VM
, sharedLocks= ]
2014-05-17 00:23:11,570 INFO  [org.ovirt.engine.core.bll.RunVmCommand] (DefaultQuartzScheduler_Worker-14) [2607ded5] Trying to rerun VM prod.gui
2014-05-17 00:23:11,577 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-14) [2607ded5] Correlation ID: 2607ded5, Job ID: d10c372d-5a8d-455b-bf66-3c8d826520af, Call Stack: null, Custom Event ID: -1, Message: Failed to restart VM prod.gui on Host srv19
014-05-17 00:25:31,312 INFO  [org.ovirt.engine.core.vdsbroker.CreateVmVDSCommand] (DefaultQuartzScheduler_Worker-14) [2607ded5] START, CreateVmVDSCommand(HostName = srv12, HostId = eac8d084-1be0-4e8d-a589-385f71fe3afa, vmId=bbb7a605-d511-461d-99d2-c5a5bf8d9958, vm=VM [prod.gui]), log id: 7dff3269

after that - second power outage, when power went back there were no further attemts to restart that vm, it were in state 'down' until manual start
in srv19 vmsm.log corresponded task failed with error 'Task is aborted: 'Unknown pool id, pool not connected' - code 309'

Comment 1 Yuriy Demchenko 2014-05-19 14:23:19 UTC
Created attachment 897183 [details]
srv12 vdsm log

srv12 vdsm log - supposedly spm after first outage

Comment 2 Yuriy Demchenko 2014-05-19 14:23:50 UTC
Created attachment 897184 [details]
srv19 vdsm log

Comment 3 Omer Frenkel 2014-05-20 08:52:18 UTC
unfortunately seems there is a bug in the loging mechanism that doesn't print the excpetion stack trace so i cant get the root cause why getting the IsoPrefix is failing, 
but according to this error from the bug description:
 'Task is aborted: 'Unknown pool id, pool not connected' - code 309'
it seems that the host wasn't really connected to the storage pool and probably caused it to fail.

maybe during the time the engine was down, the host lost its connection, but the status remained UP because that was the status before it was restarted

basically in 3.4 there is a new mechanism that re-tries to run failed HA vms, this mechanism would probably help to start the vm on another host if there was any

Comment 4 Gilad Chaplik 2014-05-20 12:02:05 UTC
Yuri, 

I thought the version is 3.4...

Gilad.

Comment 5 Yuriy Demchenko 2014-05-21 06:45:09 UTC
(In reply to Omer Frenkel from comment #3)
> unfortunately seems there is a bug in the loging mechanism that doesn't
> print the excpetion stack trace so i cant get the root cause why getting the
> IsoPrefix is failing, 
> but according to this error from the bug description:
>  'Task is aborted: 'Unknown pool id, pool not connected' - code 309'
> it seems that the host wasn't really connected to the storage pool and
> probably caused it to fail.
> 
> maybe during the time the engine was down, the host lost its connection, but
> the status remained UP because that was the status before it was restarted
> 
well, according to storage&switches logs - they were up and available like 2-4mins before engine were booted, however, i cant verify that for sure

> basically in 3.4 there is a new mechanism that re-tries to run failed HA
> vms, this mechanism would probably help to start the vm on another host if
> there was any
ok, if there's nothing else i could dig to clarify reasons of failure - then i guess i'll just upgrade to 3.4 after finishing bureaucracy procedures needed.

(In reply to Gilad Chaplik from comment #4)
> I thought the version is 3.4...
sorry, i didn't get that?

Comment 6 Michal Skrivanek 2014-05-21 06:51:51 UTC
the reported version is 3.3 so Omer's explanation applies (btw did we fix the stacktrace?)

Comment 7 Gilad Chaplik 2014-05-21 08:07:04 UTC
(In reply to Gilad Chaplik from comment #4)
> Yuri, 
> 
> I thought the version is 3.4...
> 
> Gilad.

I thought it was another Yuriy, sorry about that :)

Comment 8 Michal Skrivanek 2014-06-02 11:07:48 UTC
keep it open for the stacktrace fix. Not related to the issue actully

Comment 10 Omer Frenkel 2015-07-30 12:57:39 UTC
stack trace issue fixed upstream, it is logged in debug, while exception msg is in error