Bug 1019401
Summary: | nova: failing to boot instance from image (create new volume) because of time out | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Dafna Ron <dron> | ||||
Component: | openstack-nova | Assignee: | Lee Yarwood <lyarwood> | ||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | yeylon <yeylon> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | medium | ||||||
Version: | 4.0 | CC: | andres, dron, eglynn, jwang, jwaterwo, lyarwood, ndipanov, sgordon, srevivo, sross, yeylon | ||||
Target Milestone: | --- | Keywords: | Triaged, ZStream | ||||
Target Release: | 6.0 (Juno) | ||||||
Hardware: | x86_64 | ||||||
OS: | Linux | ||||||
Whiteboard: | storage | ||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2015-11-02 14:42:51 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Dafna Ron
2013-10-15 15:58:22 UTC
I'm adding two config options: poll_blk_dev_tries (default is 60) and poll_blk_dev_interval (default is 1). Currently running tests, but I thought I'd give an update. Sound good? yes :) you're awesome! It looks like upstream wants a much more complicated fix to this issue. We may have to do a downstream patch on this one for the moment and wait for rhos-5.0 for the more complicated fix (changing the Cinder API to actually provide status information). Adding the link to the upstream discussion when the patch for the config options was first proposed https://review.openstack.org/#/c/42876/6/ for completeness. Special attention should be paid to inline comments on the linked review. I've conferred with some others on the downstream list, and it looks like it's been decided to just do the upstream fix and backport that. I'm beginning working on a patch to the upstream Cinder and Nova APIs. This bug also affects booting instance from volume snapshot, because a bootable volume is created here and nova times out on it. openstack-cinder-2013.2.1-5.el6ost.noarch openstack-nova-2013.2.1-2.el6ost.noarch 2014-02-03 12:37:26.602 25255 ERROR nova.compute.manager [req-a1a3c7fe-6166-476f-bc3c-86fa816187b4 a74de68faf7a47ac8559c8460121da6e 0db2318a153e4a9486796627e35c5e1e] [i nstance: 5691dc36-983e-403b-a46f-90ad81e36fa5] Instance failed block device setup 2014-02-03 12:37:26.602 25255 TRACE nova.compute.manager [instance: 5691dc36-983e-403b-a46f-90ad81e36fa5] Traceback (most recent call last): 2014-02-03 12:37:26.602 25255 TRACE nova.compute.manager [instance: 5691dc36-983e-403b-a46f-90ad81e36fa5] File "/usr/lib/python2.6/site-packages/nova/compute/manager. py", line 1381, in _prep_block_device 2014-02-03 12:37:26.602 25255 TRACE nova.compute.manager [instance: 5691dc36-983e-403b-a46f-90ad81e36fa5] self._await_block_device_map_created) + 2014-02-03 12:37:26.602 25255 TRACE nova.compute.manager [instance: 5691dc36-983e-403b-a46f-90ad81e36fa5] File "/usr/lib/python2.6/site-packages/nova/virt/block_devic e.py", line 283, in attach_block_devices 2014-02-03 12:37:26.602 25255 TRACE nova.compute.manager [instance: 5691dc36-983e-403b-a46f-90ad81e36fa5] block_device_mapping) 2014-02-03 12:37:26.602 25255 TRACE nova.compute.manager [instance: 5691dc36-983e-403b-a46f-90ad81e36fa5] File "/usr/lib/python2.6/site-packages/nova/virt/block_devic e.py", line 215, in attach 2014-02-03 12:37:26.602 25255 TRACE nova.compute.manager [instance: 5691dc36-983e-403b-a46f-90ad81e36fa5] wait_func(context, vol['id']) 2014-02-03 12:37:26.602 25255 TRACE nova.compute.manager [instance: 5691dc36-983e-403b-a46f-90ad81e36fa5] File "/usr/lib/python2.6/site-packages/nova/compute/manager. py", line 901, in _await_block_device_map_created 2014-02-03 12:37:26.602 25255 TRACE nova.compute.manager [instance: 5691dc36-983e-403b-a46f-90ad81e36fa5] attempts=attempts) 2014-02-03 12:37:26.602 25255 TRACE nova.compute.manager [instance: 5691dc36-983e-403b-a46f-90ad81e36fa5] VolumeNotCreated: Volume 72128f54-3099-4cd3-80d5-80e7146e8a1e did not finish being created even after we waited 66 seconds or 60 attempts. I realized that I haven't actually written down what went on in the upstream discussion here. Basically, there were some people in favor of having a variable timeout, and some that were in favor of revamping the API to automatically customize the timeout, and so on and so forth. However, it was mentioned that technically, this is just a convenience operation that wraps two separate operations: creating a volume from an image or snapshot, and then booting an instance from the resulting volume. For the moment, I think we should add a release note that under certain configurations users should avoid using the convenience method, and just perform the two methods separately. Hopefully, eventually we'll be able to reach a consensus upstream about how to fix this. Following what Solly wrote, I think that if this option currently fails on small volumes (10GB), and there is no fix decided on for the near future, a release note is not enough :) we need to make it a tech preview or/and set timeout to a larger value on red hat installations + add a workaround on how to increase the timeout if needed. Just for feedback: this issue is still very much alive and causing problems in production deployments with volume (SAN) backends - where volume sizes are larger than in development environments. Affects heavily backup/snaphot restoration process - which easily run into timeout limits. I hit this issue again on RHELOSP6. 1. Cinder backend is LVM 2. Glance image virtual size is 110G (In reply to Andres Toomsalu from comment #12) > Just for feedback: this issue is still very much alive and causing problems > in production deployments with volume (SAN) backends - where volume sizes > are larger than in development environments. Affects heavily backup/snaphot > restoration process - which easily run into timeout limits. (In reply to jwang from comment #15) > I hit this issue again on RHELOSP6. > > 1. > Cinder backend is LVM > > 2. > Glance image virtual size is 110G Hello Dafna, Andres, jwang, Jack, can you confirm which version of nova you are using in your environments? I believe the following change introduced configurables in Juno / RHEL OSP 6 and then Icehouse / RHEL OSP 5 (via 2014.1.4) that can be used here : [juno] Make the block device mapping retries configurable https://review.openstack.org/#/c/102891/ [stable/icehouse] Make the block device mapping retries configurable https://review.openstack.org/#/c/129276/ ~~~ Make the block device mapping retries configurable When booting instances passing in block-device and increasing the volume size, instances can go in to error state if the volume takes longer to create than the hard code value (max_tries(180)/wait_between(1)) set in nova/compute/manager.py def _await_block_device_map_created(self, context, vol_id, max_tries=180, wait_between=1): To fix this, max_retries/wait_between should be made configurable. Looking through the different releases, Grizzly was 30, Havana was 60 , IceHouse is 180. This change adds two configuration options: a) `block_device_allocate_retries` which can be set in nova.conf by the user to configure the number of block device mapping retries. It defaults to 60 and replaces the max_tries argument in the above method. b) `block_device_allocate_retries_interval` which allows the user to specify the time interval between consecutive retries. It defaults to 3 and replaces wait_between argument in the above method. ~~~ Closing this out with CURRENTRELEASE given c#16, please reopen if this is still a problem. We were using Icehouse. |