Bug 1731061

Summary: Unable to spawn instances in RHOSP10
Product: Red Hat OpenStack Reporter: Ganesh Kadam <gkadam>
Component: openstack-novaAssignee: Artom Lifshitz <alifshit>
Status: CLOSED NOTABUG QA Contact: OSP DFG:Compute <osp-dfg-compute>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 10.0 (Newton)CC: akekane, alifshit, dasmith, dhill, eglynn, ipetrova, jhakimra, kchamart, lyarwood, mbooth, pkesavar, rcarrier, sbauza, sgordon, vromanso, ykulkarn
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-08-06 12:08:30 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Reproducer script none

Comment 3 Matthew Booth 2019-07-18 13:10:06 UTC
Potentially related: https://review.opendev.org/#/c/667421/

Comment 5 David Hill 2019-07-19 14:47:53 UTC
I'm in a remote session with the customer and converted the cirros qcow2 image to raw ... and I can now spawn images on those "broken" hypervisors.

Comment 6 David Hill 2019-07-19 15:20:04 UTC
So it looks like this is happening on a working node :

2019-07-19 15:18:21.057 229504 DEBUG oslo_concurrency.lockutils [req-d03802f7-f4dd-4cd4-9f9e-c0941e1d9561 9dde5fd5e24f4f449bdd9885ca30ed02 d308ddb492c146ddacc97fdfd4e96681 - - -] Lock "a7cd97dc507872091424568d5ad4d7dd6bdec0b0" released by "nova.virt.libvirt.imagebackend.fetch_func_sync" :: held 0.000s inner /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:282
2019-07-19 15:18:21.067 229504 DEBUG nova.virt.libvirt.storage.rbd_utils [req-d03802f7-f4dd-4cd4-9f9e-c0941e1d9561 9dde5fd5e24f4f449bdd9885ca30ed02 d308ddb492c146ddacc97fdfd4e96681 - - -] rbd image 85d27041-1425-4e67-88bc-829573c84345_disk does not exist __init__ /usr/lib/python2.7/site-packages/nova/virt/libvirt/storage/rbd_utils.py:77
2019-07-19 15:18:21.069 229504 DEBUG oslo_concurrency.processutils [req-d03802f7-f4dd-4cd4-9f9e-c0941e1d9561 9dde5fd5e24f4f449bdd9885ca30ed02 d308ddb492c146ddacc97fdfd4e96681 - - -] Running cmd (subprocess): rbd import --pool vms /var/lib/nova/instances/_base/a7cd97dc507872091424568d5ad4d7dd6bdec0b0 85d27041-1425-4e67-88bc-829573c84345_disk --image-format=2 --id openstack --conf /etc/ceph/ceph.conf execute /usr/lib/python2.7/site-packages/oslo_concurrency/processutils.py:344
2019-07-19 15:18:21.464 229504 DEBUG oslo_concurrency.processutils [req-d03802f7-f4dd-4cd4-9f9e-c0941e1d9561 9dde5fd5e24f4f449bdd9885ca30ed02 d308ddb492c146ddacc97fdfd4e96681 - - -] CMD "rbd import --pool vms /var/lib/nova/instances/_base/a7cd97dc507872091424568d5ad4d7dd6bdec0b0 85d27041-1425-4e67-88bc-829573c84345_disk --image-format=2 --id openstack --conf /etc/ceph/ceph.conf" returned: 0 in 0.395s execute /usr/lib/python2.7/site-packages/oslo_concurrency/processutils.py:374

and it's not on a broken node.

Comment 7 David Hill 2019-07-19 15:34:08 UTC
The problem is easy to fix actually:

1) On each broken compute, delete /var/lib/nova/instances/_base/*.part
2) On each broken compute, restart openstack-nova-compute
3) Spawn new instances as if no issues were ever encountered

Comment 8 David Hill 2019-07-19 19:57:43 UTC
It looks like glance download is broken if the desination file is already present ... no logs are showing in normal or debug mode.  This appears to be a bug as if instances are being spawned and the compute crashes without cleaning the .part file, using that stale image will never be possible until the .part file is cleaned up.

Comment 34 Abhishek Kekane 2019-08-02 13:13:29 UTC
Created attachment 1600019 [details]
Reproducer script

You need to run this script using root user and pass two input parameters, one is image_id and other one is rados_id.

Comment 43 Artom Lifshitz 2019-08-06 12:08:30 UTC
I'm going to close this as NOTABUG because it's not a Nova bug, though I realize the case is ongoing and it might come back into the Compute DFG's court, in which case by all means feel free to re-open this bug.

I did propose a patch upstream [1] that adds some more detailed Glance logging in the hopes that if something similar arises in the future better logging would help us identify the issue quicker, but because of DFG priority and capacity constraints I don't want to make a thing out of that patch (by tracking it with a BZ, for example).

[1] https://review.opendev.org/674791

Comment 44 David Hill 2019-08-06 13:46:19 UTC
Wouldn't a better patch to actually implement a timeout in the glanceclient / glanceclient call so it actually times out instead of blocking ? 

This is not a nova bug but it is in the sense that nova calls glanceclient and glanceclient never completes for various reason.  I think it should be more resilient to this behavior ...

Comment 45 Artom Lifshitz 2019-08-07 14:08:12 UTC
(In reply to David Hill from comment #44)
> Wouldn't a better patch to actually implement a timeout in the glanceclient
> / glanceclient call so it actually times out instead of blocking ? 
> 
> This is not a nova bug but it is in the sense that nova calls glanceclient
> and glanceclient never completes for various reason.  I think it should be
> more resilient to this behavior ...

I don't disagree, but this is more of a capacity/priority issue. Yeah, it'd be good to do as you say, but given we have a long backlog of other things that directly cause problems to customers, they're going to get done first, and I just wanted to set expectations right away that the improvement you mention is very unlikely to happen.

Comment 46 Artom Lifshitz 2019-08-07 14:09:06 UTC
(In reply to Artom Lifshitz from comment #43)
> I'm going to close this as NOTABUG because it's not a Nova bug, though I
> realize the case is ongoing and it might come back into the Compute DFG's
> court, in which case by all means feel free to re-open this bug.
> 
> I did propose a patch upstream [1] that adds some more detailed Glance
> logging in the hopes that if something similar arises in the future better
> logging would help us identify the issue quicker, but because of DFG
> priority and capacity constraints I don't want to make a thing out of that
> patch (by tracking it with a BZ, for example).
> 
> [1] https://review.opendev.org/674791

Turns out we already have a glance debug option: https://docs.openstack.org/nova/latest/configuration/config.html#glance.debug

Why it's not included in the main debug option, I have no idea.