Bug 1731061

Summary:

Unable to spawn instances in RHOSP10

Product:

Red Hat OpenStack

Reporter:

Ganesh Kadam <gkadam>

Component:

openstack-nova

Assignee:

Artom Lifshitz <alifshit>

Status:

CLOSED NOTABUG

QA Contact:

OSP DFG:Compute <osp-dfg-compute>

Severity:

urgent

Docs Contact:

Priority:

urgent

Version:

10.0 (Newton)

CC:

akekane, alifshit, dasmith, dhill, eglynn, ipetrova, jhakimra, kchamart, lyarwood, mbooth, pkesavar, rcarrier, sbauza, sgordon, vromanso, ykulkarn

Target Milestone:

---

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2019-08-06 12:08:30 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
Reproducer script	none

Comment 3 Matthew Booth 2019-07-18 13:10:06 UTC

Potentially related: https://review.opendev.org/#/c/667421/

Comment 5 David Hill 2019-07-19 14:47:53 UTC

I'm in a remote session with the customer and converted the cirros qcow2 image to raw ... and I can now spawn images on those "broken" hypervisors.

Comment 6 David Hill 2019-07-19 15:20:04 UTC

So it looks like this is happening on a working node :

2019-07-19 15:18:21.057 229504 DEBUG oslo_concurrency.lockutils [req-d03802f7-f4dd-4cd4-9f9e-c0941e1d9561 9dde5fd5e24f4f449bdd9885ca30ed02 d308ddb492c146ddacc97fdfd4e96681 - - -] Lock "a7cd97dc507872091424568d5ad4d7dd6bdec0b0" released by "nova.virt.libvirt.imagebackend.fetch_func_sync" :: held 0.000s inner /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:282
2019-07-19 15:18:21.067 229504 DEBUG nova.virt.libvirt.storage.rbd_utils [req-d03802f7-f4dd-4cd4-9f9e-c0941e1d9561 9dde5fd5e24f4f449bdd9885ca30ed02 d308ddb492c146ddacc97fdfd4e96681 - - -] rbd image 85d27041-1425-4e67-88bc-829573c84345_disk does not exist __init__ /usr/lib/python2.7/site-packages/nova/virt/libvirt/storage/rbd_utils.py:77
2019-07-19 15:18:21.069 229504 DEBUG oslo_concurrency.processutils [req-d03802f7-f4dd-4cd4-9f9e-c0941e1d9561 9dde5fd5e24f4f449bdd9885ca30ed02 d308ddb492c146ddacc97fdfd4e96681 - - -] Running cmd (subprocess): rbd import --pool vms /var/lib/nova/instances/_base/a7cd97dc507872091424568d5ad4d7dd6bdec0b0 85d27041-1425-4e67-88bc-829573c84345_disk --image-format=2 --id openstack --conf /etc/ceph/ceph.conf execute /usr/lib/python2.7/site-packages/oslo_concurrency/processutils.py:344
2019-07-19 15:18:21.464 229504 DEBUG oslo_concurrency.processutils [req-d03802f7-f4dd-4cd4-9f9e-c0941e1d9561 9dde5fd5e24f4f449bdd9885ca30ed02 d308ddb492c146ddacc97fdfd4e96681 - - -] CMD "rbd import --pool vms /var/lib/nova/instances/_base/a7cd97dc507872091424568d5ad4d7dd6bdec0b0 85d27041-1425-4e67-88bc-829573c84345_disk --image-format=2 --id openstack --conf /etc/ceph/ceph.conf" returned: 0 in 0.395s execute /usr/lib/python2.7/site-packages/oslo_concurrency/processutils.py:374

and it's not on a broken node.

Comment 7 David Hill 2019-07-19 15:34:08 UTC

The problem is easy to fix actually:

1) On each broken compute, delete /var/lib/nova/instances/_base/*.part
2) On each broken compute, restart openstack-nova-compute
3) Spawn new instances as if no issues were ever encountered

Comment 8 David Hill 2019-07-19 19:57:43 UTC

It looks like glance download is broken if the desination file is already present ... no logs are showing in normal or debug mode.  This appears to be a bug as if instances are being spawned and the compute crashes without cleaning the .part file, using that stale image will never be possible until the .part file is cleaned up.

Comment 34 Abhishek Kekane 2019-08-02 13:13:29 UTC

Created attachment 1600019 [details]
Reproducer script

You need to run this script using root user and pass two input parameters, one is image_id and other one is rados_id.

Comment 43 Artom Lifshitz 2019-08-06 12:08:30 UTC

I'm going to close this as NOTABUG because it's not a Nova bug, though I realize the case is ongoing and it might come back into the Compute DFG's court, in which case by all means feel free to re-open this bug.

I did propose a patch upstream [1] that adds some more detailed Glance logging in the hopes that if something similar arises in the future better logging would help us identify the issue quicker, but because of DFG priority and capacity constraints I don't want to make a thing out of that patch (by tracking it with a BZ, for example).

[1] https://review.opendev.org/674791

Comment 44 David Hill 2019-08-06 13:46:19 UTC

Wouldn't a better patch to actually implement a timeout in the glanceclient / glanceclient call so it actually times out instead of blocking ? 

This is not a nova bug but it is in the sense that nova calls glanceclient and glanceclient never completes for various reason.  I think it should be more resilient to this behavior ...

Comment 45 Artom Lifshitz 2019-08-07 14:08:12 UTC

(In reply to David Hill from comment #44)
> Wouldn't a better patch to actually implement a timeout in the glanceclient
> / glanceclient call so it actually times out instead of blocking ? 
> 
> This is not a nova bug but it is in the sense that nova calls glanceclient
> and glanceclient never completes for various reason.  I think it should be
> more resilient to this behavior ...

I don't disagree, but this is more of a capacity/priority issue. Yeah, it'd be good to do as you say, but given we have a long backlog of other things that directly cause problems to customers, they're going to get done first, and I just wanted to set expectations right away that the improvement you mention is very unlikely to happen.

Comment 46 Artom Lifshitz 2019-08-07 14:09:06 UTC

(In reply to Artom Lifshitz from comment #43)
> I'm going to close this as NOTABUG because it's not a Nova bug, though I
> realize the case is ongoing and it might come back into the Compute DFG's
> court, in which case by all means feel free to re-open this bug.
> 
> I did propose a patch upstream [1] that adds some more detailed Glance
> logging in the hopes that if something similar arises in the future better
> logging would help us identify the issue quicker, but because of DFG
> priority and capacity constraints I don't want to make a thing out of that
> patch (by tracking it with a BZ, for example).
> 
> [1] https://review.opendev.org/674791

Turns out we already have a glance debug option: https://docs.openstack.org/nova/latest/configuration/config.html#glance.debug

Why it's not included in the main debug option, I have no idea.