Bug 1459695

Summary:	Instance should not stuck in Resuming state forever when qemu crashes
Product:	Red Hat OpenStack	Reporter:	Yuri Obshansky <yobshans>
Component:	openstack-nova	Assignee:	OSP DFG:Compute <osp-dfg-compute>
Status:	CLOSED EOL	QA Contact:	OSP DFG:Compute <osp-dfg-compute>
Severity:	high	Docs Contact:
Priority:	high
Version:	11.0 (Ocata)	CC:	berrange, dasmith, eglynn, kchamart, mbooth, sbauza, sferdjao, sgordon, srevivo, vromanso
Target Milestone:	---	Keywords:	Triaged, ZStream
Target Release:	11.0 (Ocata)
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-06-22 12:40:54 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Yuri Obshansky 2017-06-07 20:59:10 UTC

Description of problem:
This issue related to bug https://bugzilla.redhat.com/show_bug.cgi?id=1425516
"Bug 1425516 - Instance stuck resuming from suspend state during load test"
As I expect, Instance should go to Error state when qemu crashes and not stuck in Resuming state forever.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Kashyap Chamarthy 2017-06-09 12:56:29 UTC

Okay, the root cause of the bug you linked seem to be QEMU process crashing due to a SeaBIOS problem (from https://bugzilla.redhat.com/show_bug.cgi?id=1425516#c33):

[quote]

  a) Updating seabios to 7.4's seabios fixes it
  b) The errors are consistent with it being an SMM error 
     which we disabled in 7.4's seabios.

[/quote]

However, we still don't know the original *cause* of the error / crash (in the bug - 1425516), except that we know updating the SeaBIOS to its 7.4 version resolves the errors.

That said, the request from this bug sounds reasonable to me, from a Nova-perspective: instances should be placed in ERROR state when a crash of QEMU process occurs.

Comment 2 Sahid Ferdjaoui 2017-07-25 09:16:11 UTC

Why did you set the severity/priority to high? It's true that Nova should correctly report that the QEMU process crashed but I don't think it's something we can do easily in Nova and probably not something we should consider soon as possible.

That issue should be reported upstream first and that BZ should be closed as WONTFIX unfortunatly.

Comment 3 Matthew Booth 2017-07-25 10:00:01 UTC

(In reply to Sahid Ferdjaoui from comment #2)
> Why did you set the severity/priority to high? It's true that Nova should
> correctly report that the QEMU process crashed but I don't think it's
> something we can do easily in Nova and probably not something we should
> consider soon as possible.
> 
> That issue should be reported upstream first and that BZ should be closed as
> WONTFIX unfortunatly.

The issue is that Nova can't recover when it happens, which is a severe problem to anybody who hits it. If there's a way for Nova to recover automatically without operator intervention we could drop the priority. The fact that it's difficult to fix doesn't mean it's not severe.

I agree we should also report the bug upstream. If you do that, could you please link the launchpad bug in this bug? Please don't close this bug, though.

Comment 4 Sahid Ferdjaoui 2017-07-25 10:12:11 UTC

There is nothing severe in Nova side. If QEMU crashes not sure to know what Nova could do to recover it, as if the kernel panic not sure to know what nova could do to recover it.

Comment 5 Scott Lewis 2018-06-22 12:40:54 UTC

OSP11 is now retired, see details at https://access.redhat.com/errata/product/191/ver=11/rhel---7/x86_64/RHBA-2018:1828