Bug 1459695 - Instance should not stuck in Resuming state forever when qemu crashes
Instance should not stuck in Resuming state forever when qemu crashes
Status: NEW
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-nova (Show other bugs)
11.0 (Ocata)
x86_64 Linux
high Severity high
: ---
: 11.0 (Ocata)
Assigned To: Eoghan Glynn
Joe H. Rahme
: Triaged, ZStream
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2017-06-07 16:59 EDT by Yuri Obshansky
Modified: 2018-01-04 21:45 EST (History)
11 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Yuri Obshansky 2017-06-07 16:59:10 EDT
Description of problem:
This issue related to bug https://bugzilla.redhat.com/show_bug.cgi?id=1425516
"Bug 1425516 - Instance stuck resuming from suspend state during load test"
As I expect, Instance should go to Error state when qemu crashes and not stuck in Resuming state forever.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:
Comment 1 Kashyap Chamarthy 2017-06-09 08:56:29 EDT
Okay, the root cause of the bug you linked seem to be QEMU process crashing due to a SeaBIOS problem (from https://bugzilla.redhat.com/show_bug.cgi?id=1425516#c33):

[quote]

  a) Updating seabios to 7.4's seabios fixes it
  b) The errors are consistent with it being an SMM error 
     which we disabled in 7.4's seabios.

[/quote]

However, we still don't know the original *cause* of the error / crash (in the bug - 1425516), except that we know updating the SeaBIOS to its 7.4 version resolves the errors.

That said, the request from this bug sounds reasonable to me, from a Nova-perspective: instances should be placed in ERROR state when a crash of QEMU process occurs.
Comment 2 Sahid Ferdjaoui 2017-07-25 05:16:11 EDT
Why did you set the severity/priority to high? It's true that Nova should correctly report that the QEMU process crashed but I don't think it's something we can do easily in Nova and probably not something we should consider soon as possible.

That issue should be reported upstream first and that BZ should be closed as WONTFIX unfortunatly.
Comment 3 Matthew Booth 2017-07-25 06:00:01 EDT
(In reply to Sahid Ferdjaoui from comment #2)
> Why did you set the severity/priority to high? It's true that Nova should
> correctly report that the QEMU process crashed but I don't think it's
> something we can do easily in Nova and probably not something we should
> consider soon as possible.
> 
> That issue should be reported upstream first and that BZ should be closed as
> WONTFIX unfortunatly.

The issue is that Nova can't recover when it happens, which is a severe problem to anybody who hits it. If there's a way for Nova to recover automatically without operator intervention we could drop the priority. The fact that it's difficult to fix doesn't mean it's not severe.

I agree we should also report the bug upstream. If you do that, could you please link the launchpad bug in this bug? Please don't close this bug, though.
Comment 4 Sahid Ferdjaoui 2017-07-25 06:12:11 EDT
There is nothing severe in Nova side. If QEMU crashes not sure to know what Nova could do to recover it, as if the kernel panic not sure to know what nova could do to recover it.

Note You need to log in before you can comment on or make changes to this bug.