Bug 1731061
Summary: | Unable to spawn instances in RHOSP10 | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Ganesh Kadam <gkadam> | ||||
Component: | openstack-nova | Assignee: | Artom Lifshitz <alifshit> | ||||
Status: | CLOSED NOTABUG | QA Contact: | OSP DFG:Compute <osp-dfg-compute> | ||||
Severity: | urgent | Docs Contact: | |||||
Priority: | urgent | ||||||
Version: | 10.0 (Newton) | CC: | akekane, alifshit, dasmith, dhill, eglynn, ipetrova, jhakimra, kchamart, lyarwood, mbooth, pkesavar, rcarrier, sbauza, sgordon, vromanso, ykulkarn | ||||
Target Milestone: | --- | ||||||
Target Release: | --- | ||||||
Hardware: | x86_64 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2019-08-06 12:08:30 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Comment 3
Matthew Booth
2019-07-18 13:10:06 UTC
I'm in a remote session with the customer and converted the cirros qcow2 image to raw ... and I can now spawn images on those "broken" hypervisors. So it looks like this is happening on a working node : 2019-07-19 15:18:21.057 229504 DEBUG oslo_concurrency.lockutils [req-d03802f7-f4dd-4cd4-9f9e-c0941e1d9561 9dde5fd5e24f4f449bdd9885ca30ed02 d308ddb492c146ddacc97fdfd4e96681 - - -] Lock "a7cd97dc507872091424568d5ad4d7dd6bdec0b0" released by "nova.virt.libvirt.imagebackend.fetch_func_sync" :: held 0.000s inner /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:282 2019-07-19 15:18:21.067 229504 DEBUG nova.virt.libvirt.storage.rbd_utils [req-d03802f7-f4dd-4cd4-9f9e-c0941e1d9561 9dde5fd5e24f4f449bdd9885ca30ed02 d308ddb492c146ddacc97fdfd4e96681 - - -] rbd image 85d27041-1425-4e67-88bc-829573c84345_disk does not exist __init__ /usr/lib/python2.7/site-packages/nova/virt/libvirt/storage/rbd_utils.py:77 2019-07-19 15:18:21.069 229504 DEBUG oslo_concurrency.processutils [req-d03802f7-f4dd-4cd4-9f9e-c0941e1d9561 9dde5fd5e24f4f449bdd9885ca30ed02 d308ddb492c146ddacc97fdfd4e96681 - - -] Running cmd (subprocess): rbd import --pool vms /var/lib/nova/instances/_base/a7cd97dc507872091424568d5ad4d7dd6bdec0b0 85d27041-1425-4e67-88bc-829573c84345_disk --image-format=2 --id openstack --conf /etc/ceph/ceph.conf execute /usr/lib/python2.7/site-packages/oslo_concurrency/processutils.py:344 2019-07-19 15:18:21.464 229504 DEBUG oslo_concurrency.processutils [req-d03802f7-f4dd-4cd4-9f9e-c0941e1d9561 9dde5fd5e24f4f449bdd9885ca30ed02 d308ddb492c146ddacc97fdfd4e96681 - - -] CMD "rbd import --pool vms /var/lib/nova/instances/_base/a7cd97dc507872091424568d5ad4d7dd6bdec0b0 85d27041-1425-4e67-88bc-829573c84345_disk --image-format=2 --id openstack --conf /etc/ceph/ceph.conf" returned: 0 in 0.395s execute /usr/lib/python2.7/site-packages/oslo_concurrency/processutils.py:374 and it's not on a broken node. The problem is easy to fix actually: 1) On each broken compute, delete /var/lib/nova/instances/_base/*.part 2) On each broken compute, restart openstack-nova-compute 3) Spawn new instances as if no issues were ever encountered It looks like glance download is broken if the desination file is already present ... no logs are showing in normal or debug mode. This appears to be a bug as if instances are being spawned and the compute crashes without cleaning the .part file, using that stale image will never be possible until the .part file is cleaned up. Created attachment 1600019 [details]
Reproducer script
You need to run this script using root user and pass two input parameters, one is image_id and other one is rados_id.
I'm going to close this as NOTABUG because it's not a Nova bug, though I realize the case is ongoing and it might come back into the Compute DFG's court, in which case by all means feel free to re-open this bug. I did propose a patch upstream [1] that adds some more detailed Glance logging in the hopes that if something similar arises in the future better logging would help us identify the issue quicker, but because of DFG priority and capacity constraints I don't want to make a thing out of that patch (by tracking it with a BZ, for example). [1] https://review.opendev.org/674791 Wouldn't a better patch to actually implement a timeout in the glanceclient / glanceclient call so it actually times out instead of blocking ? This is not a nova bug but it is in the sense that nova calls glanceclient and glanceclient never completes for various reason. I think it should be more resilient to this behavior ... (In reply to David Hill from comment #44) > Wouldn't a better patch to actually implement a timeout in the glanceclient > / glanceclient call so it actually times out instead of blocking ? > > This is not a nova bug but it is in the sense that nova calls glanceclient > and glanceclient never completes for various reason. I think it should be > more resilient to this behavior ... I don't disagree, but this is more of a capacity/priority issue. Yeah, it'd be good to do as you say, but given we have a long backlog of other things that directly cause problems to customers, they're going to get done first, and I just wanted to set expectations right away that the improvement you mention is very unlikely to happen. (In reply to Artom Lifshitz from comment #43) > I'm going to close this as NOTABUG because it's not a Nova bug, though I > realize the case is ongoing and it might come back into the Compute DFG's > court, in which case by all means feel free to re-open this bug. > > I did propose a patch upstream [1] that adds some more detailed Glance > logging in the hopes that if something similar arises in the future better > logging would help us identify the issue quicker, but because of DFG > priority and capacity constraints I don't want to make a thing out of that > patch (by tracking it with a BZ, for example). > > [1] https://review.opendev.org/674791 Turns out we already have a glance debug option: https://docs.openstack.org/nova/latest/configuration/config.html#glance.debug Why it's not included in the main debug option, I have no idea. |