Bug 1099134
Summary: | HA vms restart failed after power outage | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [Retired] oVirt | Reporter: | Yuriy Demchenko <demchenko.ya> | ||||||||
Component: | ovirt-engine-core | Assignee: | Eldan Shachar <eshachar> | ||||||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Pavel Stehlik <pstehlik> | ||||||||
Severity: | low | Docs Contact: | |||||||||
Priority: | unspecified | ||||||||||
Version: | 3.3 | CC: | bugs, demchenko.ya, dfediuck, gklein, michal.skrivanek, ofrenkel, rbalakri, yeylon | ||||||||
Target Milestone: | --- | ||||||||||
Target Release: | 3.6.0 | ||||||||||
Hardware: | x86_64 | ||||||||||
OS: | Linux | ||||||||||
Whiteboard: | virt | ||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2015-07-30 12:57:39 UTC | Type: | Bug | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | Virt | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Attachments: |
|
Created attachment 897183 [details]
srv12 vdsm log
srv12 vdsm log - supposedly spm after first outage
Created attachment 897184 [details]
srv19 vdsm log
unfortunately seems there is a bug in the loging mechanism that doesn't print the excpetion stack trace so i cant get the root cause why getting the IsoPrefix is failing, but according to this error from the bug description: 'Task is aborted: 'Unknown pool id, pool not connected' - code 309' it seems that the host wasn't really connected to the storage pool and probably caused it to fail. maybe during the time the engine was down, the host lost its connection, but the status remained UP because that was the status before it was restarted basically in 3.4 there is a new mechanism that re-tries to run failed HA vms, this mechanism would probably help to start the vm on another host if there was any Yuri, I thought the version is 3.4... Gilad. (In reply to Omer Frenkel from comment #3) > unfortunately seems there is a bug in the loging mechanism that doesn't > print the excpetion stack trace so i cant get the root cause why getting the > IsoPrefix is failing, > but according to this error from the bug description: > 'Task is aborted: 'Unknown pool id, pool not connected' - code 309' > it seems that the host wasn't really connected to the storage pool and > probably caused it to fail. > > maybe during the time the engine was down, the host lost its connection, but > the status remained UP because that was the status before it was restarted > well, according to storage&switches logs - they were up and available like 2-4mins before engine were booted, however, i cant verify that for sure > basically in 3.4 there is a new mechanism that re-tries to run failed HA > vms, this mechanism would probably help to start the vm on another host if > there was any ok, if there's nothing else i could dig to clarify reasons of failure - then i guess i'll just upgrade to 3.4 after finishing bureaucracy procedures needed. (In reply to Gilad Chaplik from comment #4) > I thought the version is 3.4... sorry, i didn't get that? the reported version is 3.3 so Omer's explanation applies (btw did we fix the stacktrace?) (In reply to Gilad Chaplik from comment #4) > Yuri, > > I thought the version is 3.4... > > Gilad. I thought it was another Yuriy, sorry about that :) keep it open for the stacktrace fix. Not related to the issue actully stack trace issue fixed upstream, it is logged in debug, while exception msg is in error |
Created attachment 897181 [details] engine log Description of problem: After power outage restart of HA-vms failed with error 'VdcBLLException: java.lang.NullPointerException (Failed with error ENGINE and code 5001)' Version-Release number of selected component (if applicable): centos 6.5 on engine&hosts FC SAN storage ovirt-engine-3.3.2-1.el6.noarch vdsm-4.13.2-1.el6.x86_64 How reproducible: dont know yet Actual results: HA vms in state 'down', no additional restart attempts Expected results: HA vms successfully restarted Additional info: i had a double power outage, first time power went back after 00:19, at 00:23 engine booted up and there were HA vm restart attempts that failed with error above. Second event were from 00:26 to 10:09 and after second time engine booted - only 2 vms from 46 were restarted, although all of them were configured as HA. for example, vm 'prod.gui' timeline: host srv19 were allready up when engine booted 2014-05-17 00:23:10,552 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-14) Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: State was set to Up for host srv19. 2014-05-17 00:23:10,565 INFO [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-14) vm prod.gui running in db and not running in vds - add to rerun treatment. vds srv19 2014-05-17 00:23:10,909 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-14) [2989840c] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Highly Available VM prod.gui failed. It will be restarted automatically. 2014-05-17 00:23:10,911 INFO [org.ovirt.engine.core.bll.VdsEventListener] (DefaultQuartzScheduler_Worker-14) [2989840c] Highly Available VM went down. Attempting to restart. VM Name: prod.gui, VM Id:bbb7a605-d511-461d-99d2-c5a5bf8d9958 2014-05-17 00:23:11,569 WARN [org.ovirt.engine.core.bll.RunVmCommand] (DefaultQuartzScheduler_Worker-14) [2607ded5] Failed to run VM prod.gui: VdcBLLException: java.lang.NullPointerException (Failed with error ENGINE and code 5001) 2014-05-17 00:23:11,569 INFO [org.ovirt.engine.core.bll.RunVmCommand] (DefaultQuartzScheduler_Worker-14) [2607ded5] Lock freed to object EngineLock [exclusiveLocks= key: bbb7a605-d511-461d-99d2-c5a5bf8d9958 value: VM , sharedLocks= ] 2014-05-17 00:23:11,570 INFO [org.ovirt.engine.core.bll.RunVmCommand] (DefaultQuartzScheduler_Worker-14) [2607ded5] Trying to rerun VM prod.gui 2014-05-17 00:23:11,577 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-14) [2607ded5] Correlation ID: 2607ded5, Job ID: d10c372d-5a8d-455b-bf66-3c8d826520af, Call Stack: null, Custom Event ID: -1, Message: Failed to restart VM prod.gui on Host srv19 014-05-17 00:25:31,312 INFO [org.ovirt.engine.core.vdsbroker.CreateVmVDSCommand] (DefaultQuartzScheduler_Worker-14) [2607ded5] START, CreateVmVDSCommand(HostName = srv12, HostId = eac8d084-1be0-4e8d-a589-385f71fe3afa, vmId=bbb7a605-d511-461d-99d2-c5a5bf8d9958, vm=VM [prod.gui]), log id: 7dff3269 after that - second power outage, when power went back there were no further attemts to restart that vm, it were in state 'down' until manual start in srv19 vmsm.log corresponded task failed with error 'Task is aborted: 'Unknown pool id, pool not connected' - code 309'