Red Hat Bugzilla – Bug 1459695
Instance should not stuck in Resuming state forever when qemu crashes
Last modified: 2018-01-04 21:45:35 EST
Description of problem:
This issue related to bug https://bugzilla.redhat.com/show_bug.cgi?id=1425516
"Bug 1425516 - Instance stuck resuming from suspend state during load test"
As I expect, Instance should go to Error state when qemu crashes and not stuck in Resuming state forever.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
Okay, the root cause of the bug you linked seem to be QEMU process crashing due to a SeaBIOS problem (from https://bugzilla.redhat.com/show_bug.cgi?id=1425516#c33):
a) Updating seabios to 7.4's seabios fixes it
b) The errors are consistent with it being an SMM error
which we disabled in 7.4's seabios.
However, we still don't know the original *cause* of the error / crash (in the bug - 1425516), except that we know updating the SeaBIOS to its 7.4 version resolves the errors.
That said, the request from this bug sounds reasonable to me, from a Nova-perspective: instances should be placed in ERROR state when a crash of QEMU process occurs.
Why did you set the severity/priority to high? It's true that Nova should correctly report that the QEMU process crashed but I don't think it's something we can do easily in Nova and probably not something we should consider soon as possible.
That issue should be reported upstream first and that BZ should be closed as WONTFIX unfortunatly.
(In reply to Sahid Ferdjaoui from comment #2)
> Why did you set the severity/priority to high? It's true that Nova should
> correctly report that the QEMU process crashed but I don't think it's
> something we can do easily in Nova and probably not something we should
> consider soon as possible.
> That issue should be reported upstream first and that BZ should be closed as
> WONTFIX unfortunatly.
The issue is that Nova can't recover when it happens, which is a severe problem to anybody who hits it. If there's a way for Nova to recover automatically without operator intervention we could drop the priority. The fact that it's difficult to fix doesn't mean it's not severe.
I agree we should also report the bug upstream. If you do that, could you please link the launchpad bug in this bug? Please don't close this bug, though.
There is nothing severe in Nova side. If QEMU crashes not sure to know what Nova could do to recover it, as if the kernel panic not sure to know what nova could do to recover it.