Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1236183

Summary:	getting "Block Device Mapping is Invalid" error when doing "boot from image (creates a new volume)"
Product:	Red Hat OpenStack	Reporter:	Dave Maley <dmaley>
Component:	openstack-cinder	Assignee:	Eric Harney <eharney>
Status:	CLOSED NOTABUG	QA Contact:	nlevinki <nlevinki>
Severity:	high	Docs Contact:
Priority:	high
Version:	5.0 (RHEL 7)	CC:	ddomingo, dmaley, eharney, fpercoco, sgotliv, yeylon
Target Milestone:	---	Keywords:	ZStream
Target Release:	5.0 (RHEL 7)
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Release Note
Doc Text:	When the Block Storage service creates volumes from images, it downloads images from the Image service into an image conversion directory. This directory is defined by the 'image_conversion_dir' option in /etc/cinder/cinder.conf (under the [DEFAULT] section). By default, 'image_conversion_dir' is set to /var/lib/cinder/conversion. If the image conversion directory runs out of space (typically, if multiple volumes are created from large images simultaneously), any attempts to create volumes from images will fail. Further, any attempts to launch instances which would require the creation of volumes from images will fail as well. These failures will continue until the image conversion directory has enough free space. As such, you should ensure that the image conversion directory has enough space for the typical number of volumes that users simultaneously create from images. If you need to define a non-default image conversion directory, run: # openstack-config --set /etc/cinder/cinder.conf DEFAULT image_conversion_dir <NEWDIR> Replace <NEWDIR> with the new directory. Afterwards, restart the Block Storage service to apply the new setting: # openstack-service restart cinder	Story Points:	---
Clone Of:
Clones:	1244555 (view as bug list)		Environment:
Last Closed:	2015-07-19 00:12:34 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1244555

Description Dave Maley 2015-06-26 18:41:12 UTC

Description of problem:
Customer is seeing "Block Device Mapping is Invalid" errors when attempting to launch instances via the dashboard.  This is being seen intermittently in both their lab and production environments.  The problem seems to happen more frequently for images which require a larger boot volumes.


Version-Release number of selected component (if applicable):
openstack-cinder-2014.1.4-1.el7ost.noarch


How reproducible:
intermittent


Steps to Reproduce:
1. launch instance -> boot from image (creates new volume)
2. select image
3. launch

Actual results:
intermittently get "Block Device Mapping is Invalid" error


Expected results:
image launches successfully

Additional info:

Comment 2 Dave Maley 2015-06-26 18:52:41 UTC

looking at the ctrl1 logs there are 6 cinder-volume ERROR occurrences found in:
sos_commands/cluster/crm_report/xlabostkctrl1.healthehostt.com/messages

The first 4 are corrupted image errors, the last 3 of which are the same image:
Jun 17 10:32:46: Failed to copy image c993bb22-57c3-46ba-884e-4cb8621f817b to volume: 0bb2d231-22a2-40d8-8ebe-fd04b35db6a6, error: [Errno 32] Corrupted image. Checksum was 62fe8ec46b25d12224af9b2955b5b406 expected fd7b641fa45895fad682edd7984cf786

Jun 17 10:56:04: Failed to copy image ee63031b-0d91-4c02-93b2-6d2a60f4d155 to volume: 5298f9a8-a794-4573-880b-161dffd639fe, error: [Errno 32] Corrupted image. Checksum was a35a81dd4eeacf6a8abc9844403ba748 expected 3fc567c135ba42deb7f6c780921b49f2

Jun 17 11:10:27: Failed to copy image ee63031b-0d91-4c02-93b2-6d2a60f4d155 to volume: f7ee7ab1-c2db-45ce-8fe3-17f4997836c5, error: [Errno 32] Corrupted image. Checksum was 242d5bbd8f22e848e7fe5642b7a081f0 expected 3fc567c135ba42deb7f6c780921b49f2

Jun 17 11:30:15: Failed to copy image ee63031b-0d91-4c02-93b2-6d2a60f4d155 to volume: e133edc4-8e22-4453-9eb8-b3ee1fa9da56, error: [Errno 32] Corrupted image. Checksum was 0638a3058adb23ef5f8b092954cbbd20 expected 3fc567c135ba42deb7f6c780921b49f2


The last 2 may be related to bug 1234661:
Jun 17 15:20:37: error: Image 75bd1f60-f158-456b-983e-8dc992f5a5c5 is unacceptable: Size is 22GB and doesn't fit in a volume of size 3GB

Jun 18 12:08:43: error: Image 75bd1f60-f158-456b-983e-8dc992f5a5c5 is unacceptable: Size is 22GB and doesn't fit in a volume of size 3GB

Comment 3 Eric Harney 2015-06-26 21:01:45 UTC

Three of the failures show that they are trying to download image
ee63031b-0d91... and getting three different (incorrect) checksums after
copying it.  Sounds worrying.

The first thing we need to do there is rule out some basic issues... can
we somehow just download the image from Glance manually and see what the
checksum is?  (Not sure of the easiest way to do this.)

And since I think they are using the file store, also just md5sum it
directly in /var/lib/glance/images/ (?) to see if it matches what Glance
thinks the checksum should be?

Comment 4 Sergey Gotliv 2015-06-28 13:39:33 UTC

Dave,

Please, see Eric's questions in the previous comment.

Comment 5 Flavio Percoco 2015-06-28 20:49:24 UTC

I took a look at the logs from one of the nodes and there seem to be quite a few "Broken Pipe" errors in glance.api. These errors are consistent with the volume creation requests from Cinder's side, which raises the question of whether there's something happening during these requests.

In addition to the above, the mismatched checksums are not always the same. There are cases were the downloaded image's data checksum is different for the same image (based on the expected checksum).

2015-06-17 11:19:23.752 23484 ERROR oslo.messaging._drivers.common [req-71aa7f96-589a-43dc-b16a-21b2273d5e56 e998e2dabe8e4d1eb2a6b778b4422c01 2c8e7a30a0cf45d49eb6142c15d1f710 - - -] Returning exception Failed to copy image to volume: [Errno 32] Corrupted image. Checksum was 262a16cc59a1c6e37ad7e0687597ddb8 expected 3fc567c135ba42deb7f6c780921b49f2 to caller
2015-06-17 11:43:03.917 23484 ERROR oslo.messaging._drivers.common [req-dae01d75-c0f0-48a1-8295-4f6211852694 e998e2dabe8e4d1eb2a6b778b4422c01 2c8e7a30a0cf45d49eb6142c15d1f710 - - -] Returning exception Failed to copy image to volume: [Errno 32] Corrupted image. Checksum was 90267d13afe96336ad1d074bbeda9268 expected 3fc567c135ba42deb7f6c780921b49f2 to caller


The above - and the logs below - leads me to think that the checksum of image in glance might be just fine, although I still recommend following Eric's suggestions in c#3. To download the image you can:

$ glance image-download ee63031b-0d91-4c02-93b2-6d2a60f4d155 > downloaded.qcow2
$ glance image-show ee63031b-0d91-4c02-93b2-6d2a60f4d155 # Show's the checksum stored in the database

Otherwise, it'll be easier to just go to the image's directory used for glance and verify the checksum for image ee63031b-0d91-4c02-93b2-6d2a60f4d155 is the same as the one stored in the database 3fc567c135ba42deb7f6c780921b49f2/


A case of "matched" broken pipe from Glance with Cinder's failure:

Glance API:

glance/api.log:2015-06-17 11:18:49.897 23297 INFO glance.wsgi.server [4b1bc2d5-1247-4538-9d2f-3d2d724b3d92 e998e2dabe8e4d1eb2a6b778b4422c01 2c8e7a30a0cf45d49eb6142c15d1f710 - - -] Traceback (most recent call last):
--
glance/api.log-    _writelines(towrite)
glance/api.log-  File "/usr/lib64/python2.7/socket.py", line 334, in writelines
glance/api.log-    self.flush()
glance/api.log-  File "/usr/lib64/python2.7/socket.py", line 303, in flush
glance/api.log-    self._sock.sendall(view[write_offset:write_offset+buffer_size])
glance/api.log-  File "/usr/lib/python2.7/site-packages/eventlet/greenio.py", line 309, in sendall
glance/api.log-    tail = self.send(data, flags)
glance/api.log-  File "/usr/lib/python2.7/site-packages/eventlet/greenio.py", line 295, in send
glance/api.log-    total_sent += fd.send(data[total_sent:], flags)
glance/api.log-error: [Errno 32] Broken pipe
glance/api.log:2015-06-17 11:18:49.898 23297 INFO glance.wsgi.server [4b1bc2d5-1247-4538-9d2f-3d2d724b3d92 e998e2dabe8e4d1eb2a6b778b4422c01 2c8e7a30a0cf45d49eb6142c15d1f710 - - -] 192.168.100.41 - - [17/Jun/2015 11:18:49] "GET /v2/images/ee63031b-0d91-4c02-93b2-6d2a60f4d155/file HTTP/1.1" 200 10440802542 64.891102


Cinder Volume:

cinder/volume.log-2015-06-17 11:17:45.020 23484 DEBUG glanceclient.common.http [req-71aa7f96-589a-43dc-b16a-21b2273d5e56 e998e2dabe8e4d1eb2a6b778b4422c01 2c8e7a30a0cf45d49eb6142c15d1f710 - - -] 
cinder/volume.log-HTTP/1.1 200 OK
cinder/volume.log-date: Wed, 17 Jun 2015 15:17:45 GMT
cinder/volume.log-content-length: 14826602496
cinder/volume.log-content-type: application/octet-stream
cinder/volume.log-x-openstack-request-id: req-4b1bc2d5-1247-4538-9d2f-3d2d724b3d92
cinder/volume.log-content-md5: 3fc567c135ba42deb7f6c780921b49f2
cinder/volume.log- log_http_response /usr/lib/python2.7/site-packages/glanceclient/common/http.py:167
cinder/volume.log:2015-06-17 11:19:23.667 23484 ERROR cinder.volume.flows.manager.create_volume [req-71aa7f96-589a-43dc-b16a-21b2273d5e56 e998e2dabe8e4d1eb2a6b778b4422c01 2c8e7a30a0cf45d49eb6142c15d1f710 - - -] F
ailed to copy image ee63031b-0d91-4c02-93b2-6d2a60f4d155 to volume: 3f19fb90-a0c5-456d-a944-9bacdcf6a2c9, error: [Errno 32] Corrupted image. Checksum was 262a16cc59a1c6e37ad7e0687597ddb8 expected 3fc567c135ba42de
b7f6c780921b49f2
cinder/volume.log:2015-06-17 11:19:23.673 23484 DEBUG cinder.volume.flows.common [req-71aa7f96-589a-43dc-b16a-21b2273d5e56 e998e2dabe8e4d1eb2a6b778b4422c01 2c8e7a30a0cf45d49eb6142c15d1f710 - - -] Updating volume:
 3f19fb90-a0c5-456d-a944-9bacdcf6a2c9 with {'status': 'error'} due to: ??? error_out_volume /usr/lib/python2.7/site-packages/cinder/volume/flows/common.py:87
cinder/volume.log:2015-06-17 11:19:23.748 23484 DEBUG cinder.openstack.common.periodic_task [-] Running periodic task VolumeManager._publish_service_capabilities run_periodic_tasks /usr/lib/python2.7/site-package
s/cinder/openstack/common/periodic_task.py:178
cinder/volume.log:2015-06-17 11:19:23.748 23484 DEBUG cinder.manager [-] Notifying Schedulers of capabilities ... _publish_service_capabilities /usr/lib/python2.7/site-packages/cinder/manager.py:128
cinder/volume.log:2015-06-17 11:19:23.749 23484 ERROR cinder.volume.flows.manager.create_volume [req-71aa7f96-589a-43dc-b16a-21b2273d5e56 e998e2dabe8e4d1eb2a6b778b4422c01 2c8e7a30a0cf45d49eb6142c15d1f710 - - -] V
olume 3f19fb90-a0c5-456d-a944-9bacdcf6a2c9: create failed
cinder/volume.log:2015-06-17 11:19:23.750 23484 ERROR oslo.messaging.rpc.dispatcher [req-71aa7f96-589a-43dc-b16a-21b2273d5e56 e998e2dabe8e4d1eb2a6b778b4422c01 2c8e7a30a0cf45d49eb6142c15d1f710 - - -] Exception dur
ing message handling: Failed to copy image to volume: [Errno 32] Corrupted image. Checksum was 262a16cc59a1c6e37ad7e0687597ddb8 expected 3fc567c135ba42deb7f6c780921b49f2
cinder/volume.log:2015-06-17 11:19:23.750 23484 TRACE oslo.messaging.rpc.dispatcher Traceback (most recent call last):
cinder/volume.log:2015-06-17 11:19:23.750 23484 TRACE oslo.messaging.rpc.dispatcher   File "/usr/lib/python2.7/site-packages/oslo/messaging/rpc/dispatcher.py", line 133, in _dispatch_and_reply
cinder/volume.log:2015-06-17 11:19:23.750 23484 TRACE oslo.messaging.rpc.dispatcher     incoming.message))
cinder/volume.log:2015-06-17 11:19:23.750 23484 TRACE oslo.messaging.rpc.dispatcher   File "/usr/lib/python2.7/site-packages/oslo/messaging/rpc/dispatcher.py", line 176, in _dispatch
cinder/volume.log:2015-06-17 11:19:23.750 23484 TRACE oslo.messaging.rpc.dispatcher     return self._do_dispatch(endpoint, method, ctxt, args)
cinder/volume.log:2015-06-17 11:19:23.750 23484 TRACE oslo.messaging.rpc.dispatcher   File "/usr/lib/python2.7/site-packages/oslo/messaging/rpc/dispatcher.py", line 122, in _do_dispatch
cinder/volume.log:2015-06-17 11:19:23.750 23484 TRACE oslo.messaging.rpc.dispatcher     result = getattr(endpoint, method)(ctxt, **new_args)
cinder/volume.log:2015-06-17 11:19:23.750 23484 TRACE oslo.messaging.rpc.dispatcher   File "/usr/lib/python2.7/site-packages/cinder/volume/manager.py", line 363, in create_volume
cinder/volume.log:2015-06-17 11:19:23.750 23484 TRACE oslo.messaging.rpc.dispatcher     _run_flow()
cinder/volume.log:2015-06-17 11:19:23.750 23484 TRACE oslo.messaging.rpc.dispatcher   File "/usr/lib/python2.7/site-packages/cinder/volume/manager.py", line 356, in _run_flow
cinder/volume.log:2015-06-17 11:19:23.750 23484 TRACE oslo.messaging.rpc.dispatcher     flow_engine.run()
cinder/volume.log:2015-06-17 11:19:23.750 23484 TRACE oslo.messaging.rpc.dispatcher   File "/usr/lib/python2.7/site-packages/taskflow/utils/lock_utils.py", line 51, in wrapper
cinder/volume.log:2015-06-17 11:19:23.750 23484 TRACE oslo.messaging.rpc.dispatcher     return f(*args, **kwargs)
cinder/volume.log:2015-06-17 11:19:23.750 23484 TRACE oslo.messaging.rpc.dispatcher   File "/usr/lib/python2.7/site-packages/taskflow/engines/action_engine/engine.py", line 118, in run
cinder/volume.log:2015-06-17 11:19:23.750 23484 TRACE oslo.messaging.rpc.dispatcher     self._run()
cinder/volume.log:2015-06-17 11:19:23.750 23484 TRACE oslo.messaging.rpc.dispatcher   File "/usr/lib/python2.7/site-packages/taskflow/engines/action_engine/engine.py", line 128, in _run
cinder/volume.log:2015-06-17 11:19:23.750 23484 TRACE oslo.messaging.rpc.dispatcher     self._revert(misc.Failure())
cinder/volume.log:2015-06-17 11:19:23.750 23484 TRACE oslo.messaging.rpc.dispatcher   File "/usr/lib/python2.7/site-packages/taskflow/engines/action_engine/engine.py", line 81, in _revert
cinder/volume.log:2015-06-17 11:19:23.750 23484 TRACE oslo.messaging.rpc.dispatcher     misc.Failure.reraise_if_any(failures.values())
cinder/volume.log:2015-06-17 11:19:23.750 23484 TRACE oslo.messaging.rpc.dispatcher   File "/usr/lib/python2.7/site-packages/taskflow/utils/misc.py", line 487, in reraise_if_any
cinder/volume.log:2015-06-17 11:19:23.750 23484 TRACE oslo.messaging.rpc.dispatcher     failures[0].reraise()
cinder/volume.log:2015-06-17 11:19:23.750 23484 TRACE oslo.messaging.rpc.dispatcher   File "/usr/lib/python2.7/site-packages/taskflow/utils/misc.py", line 494, in reraise
cinder/volume.log:2015-06-17 11:19:23.750 23484 TRACE oslo.messaging.rpc.dispatcher     six.reraise(*self._exc_info)
cinder/volume.log:2015-06-17 11:19:23.750 23484 TRACE oslo.messaging.rpc.dispatcher   File "/usr/lib/python2.7/site-packages/taskflow/engines/action_engine/executor.py", line 36, in _execute_task
cinder/volume.log:2015-06-17 11:19:23.750 23484 TRACE oslo.messaging.rpc.dispatcher     result = task.execute(**arguments)
cinder/volume.log:2015-06-17 11:19:23.750 23484 TRACE oslo.messaging.rpc.dispatcher   File "/usr/lib/python2.7/site-packages/cinder/volume/flows/manager/create_volume.py", line 594, in execute
cinder/volume.log:2015-06-17 11:19:23.750 23484 TRACE oslo.messaging.rpc.dispatcher     **volume_spec)
cinder/volume.log:2015-06-17 11:19:23.750 23484 TRACE oslo.messaging.rpc.dispatcher   File "/usr/lib/python2.7/site-packages/cinder/volume/flows/manager/create_volume.py", line 556, in _create_from_image
cinder/volume.log:2015-06-17 11:19:23.750 23484 TRACE oslo.messaging.rpc.dispatcher     image_id, image_location, image_service)
cinder/volume.log:2015-06-17 11:19:23.750 23484 TRACE oslo.messaging.rpc.dispatcher   File "/usr/lib/python2.7/site-packages/cinder/volume/flows/manager/create_volume.py", line 470, in _copy_image_to_volume
cinder/volume.log:2015-06-17 11:19:23.750 23484 TRACE oslo.messaging.rpc.dispatcher     raise exception.ImageCopyFailure(reason=ex)
cinder/volume.log:2015-06-17 11:19:23.750 23484 TRACE oslo.messaging.rpc.dispatcher ImageCopyFailure: Failed to copy image to volume: [Errno 32] Corrupted image. Checksum was 262a16cc59a1c6e37ad7e0687597ddb8 expe
cted 3fc567c135ba42deb7f6c780921b49f2

Comment 6 Flavio Percoco 2015-06-28 20:52:36 UTC

I'd also recommend enabling DEBUG on glance-(api|registry), that should give us some extra info.

Comment 7 Flavio Percoco 2015-06-29 08:48:01 UTC

One more question that I forgot to add to my previous comments:

Is HAproxy (or any load balancer) configured on top of glance-api? Based on the logs, the failures are related to quite big images - 15GB, 22GB - which leads me to think there could be some timeouts/issues related to the time/bandwidth required to download such images.

Again, downloading the image manually (with glance image-download) could help understanding where the issue is originated.

Comment 8 Dave Maley 2015-06-30 17:05:06 UTC

GSS has requested the customer manually verify the image via glance image-download
and then compare it to what glance image-show returns from the DB

Also to note that according to the sosreports this is the same lab environment that's having networking issues (bug 1232502)

Comment 11 Dave Maley 2015-07-07 23:38:18 UTC

In addition to the above outputs provided by the customer today I've also passed along our recommendation to enable glance debug logging.  I've also pointed out the haproxy configuration issue from comment 9.

Comment 12 Sergey Gotliv 2015-07-08 05:45:52 UTC

I don't have an access to the SOS report for some reason. We have to check "image_conversion_dir" value in the cinder volume log to check if this folder has enough space to download 13+ GB image.

Comment 14 Dave Maley 2015-07-08 15:33:32 UTC

(In reply to Sergey Gotliv from comment #12)
> I don't have an access to the SOS report for some reason. We have to check
> "image_conversion_dir" value in the cinder volume log to check if this
> folder has enough space to download 13+ GB image.

sosreports should now be accessible.  I'm not seeing any image_conversion_dir references in the cinder logs.  Their cinder.conf has it set as the default value:

image_conversion_dir=$state_path/conversion

Comment 16 Sergey Gotliv 2015-07-13 11:23:25 UTC

Dave,

Can we close that bug for 5.0? We can open another bug for 7.0 to make sure that customer is getting a relevant meaningful error message when it happens.

Comment 17 Dave Maley 2015-07-14 19:45:04 UTC

(In reply to Sergey Gotliv from comment #16)
> Dave,
> 
> Can we close that bug for 5.0? We can open another bug for 7.0 to make sure
> that customer is getting a relevant meaningful error message when it happens.

I'm fine closing this bug once we've provided the recommendation on space requirement for the conversion dir.  I know we can't give a definitive answer here however I'd like for us to at least provide some general guidance if possible.

Agreed on opening a new osp7 bug for improving the error message.

Comment 18 Sergey Gotliv 2015-07-19 00:12:34 UTC

(In reply to Dave Maley from comment #17)
> (In reply to Sergey Gotliv from comment #16)
> > Dave,
> > 
> > Can we close that bug for 5.0? We can open another bug for 7.0 to make sure
> > that customer is getting a relevant meaningful error message when it happens.
> 
> I'm fine closing this bug once we've provided the recommendation on space
> requirement for the conversion dir.  I know we can't give a definitive
> answer here however I'd like for us to at least provide some general
> guidance if possible.

Done. Please, see release notes.

> 
> Agreed on opening a new osp7 bug for improving the error message.

It seems that in Kilo the message is indicating that there is a space issue.

I am closing that case based on the conversation we had with the customer and  after providing release notes explaining the issue.