Bug 832449 - Failiure to start jeos build once converted to qcow2
Summary: Failiure to start jeos build once converted to qcow2
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: Fedora
Classification: Fedora
Component: openstack-glance
Version: 17
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Eoghan Glynn
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2012-06-15 12:28 UTC by David Busby
Modified: 2012-07-09 20:18 UTC (History)
19 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2012-07-09 20:18:16 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
Angus Salkeld's yum.history (56.76 KB, text/plain)
2012-07-04 13:03 UTC, Eoghan Glynn
no flags Details
yum log (42.51 KB, application/octet-stream)
2012-07-04 13:17 UTC, Angus Salkeld
no flags Details
YUm history (66.68 KB, text/plain)
2012-07-04 13:24 UTC, David Busby
no flags Details
yum log (46.26 KB, text/plain)
2012-07-04 13:25 UTC, David Busby
no flags Details

Description David Busby 2012-06-15 12:28:17 UTC
Description of problem:

el6 build from CentOS6.2 test using instructions on wiki for fedora jeos.


Version-Release number of selected component (if applicable):

2012.1-5

How reproducible:

Build el6 jeos image using oz-install, convert to qcow2 and import into glance

Steps to Reproduce:

1. sudo oz-install el6.tdl
2. convert -c -p -f raw -O qcow2  /var/lib/libvirt/images/el6.dsk /path/to/el6.qcow2  
3. glance add name=el6-jeos is_public=true disk_format=qcow2 container_format=ovf < /path/to/el6.qcow2

  
Actual results:


compute.log
...
2012-06-15 13:10:20 TRACE nova.rpc.amqp   File "/usr/lib/python2.7/site-packages/nova/virt/images.py", line 99, in fetch_to_raw
2012-06-15 13:10:20 TRACE nova.rpc.amqp     path_tmp, staged)
2012-06-15 13:10:20 TRACE nova.rpc.amqp   File "/usr/lib/python2.7/site-packages/nova/utils.py", line 243, in execute
2012-06-15 13:10:20 TRACE nova.rpc.amqp     cmd=' '.join(cmd))
2012-06-15 13:10:20 TRACE nova.rpc.amqp ProcessExecutionError: Unexpected error while running command.
2012-06-15 13:10:20 TRACE nova.rpc.amqp Command: qemu-img convert -O raw /var/lib/nova/instances/_base/2ea04dba5f407cbd19093a0bfe27c04e1a561cb8.part /var/lib/nova/instances/_base/2ea04dba5f407cbd19093a0bfe27c04e1a561cb8.converted
2012-06-15 13:10:20 TRACE nova.rpc.amqp Exit code: 1
2012-06-15 13:10:20 TRACE nova.rpc.amqp Stdout: ''
2012-06-15 13:10:20 TRACE nova.rpc.amqp Stderr: 'qemu-img: error while reading sector 12734464: Input/output error\n'
...

Expected results:

Started instance.

Comment 1 Jan van Eldik 2012-06-15 12:47:58 UTC
Are you really using "convert" to convert the image?

We successfully follow steps very similar to the ones you outline, 
but use 

  qemu-img convert -c -O qcow2 /var/lib/libvirt/images/el6.dsk
/path/to/el6.qcow2

to convert the image.

Comment 2 Pádraig Brady 2012-06-15 13:18:07 UTC
So maybe this should be moved to the qemu package,
as there seem to be other reports (though this is the other direction):
http://lists.gnu.org/archive/html/qemu-devel/2011-09/msg03847.html

I/O error is a bit strange though. Are you sure there no issue with storage media etc? Also I'd qemu-info check the dsk image and also try to boot it to see if it's OK itself.

Comment 3 David Busby 2012-06-15 14:50:29 UTC
There's no immediately obvious storage issues this is a new system with a new f17 install


--- qemu-img check (el6.dsk)

qemu-img: This image format does not support checks

--- qemu-img check (-c -p -f raw -O qcow2 .. el6-1.qcow2)

No errors were found on the image.

--- qemu-img check (-c -O qcow2 .. el6-2.qcow2)

No errors were found on the image.

--- test launch of el6-1.qcow2

glance add name=el6-1 is_public=true disk_format=qcow2 container_format=ovf < el6-1.qcow2
Uploading image 'el6-1'
===========================================[100%] 64.6M/s, ETA  0h  0m  0s
Added new image with ID: aabc8b37-d230-46da-9945-cbbd1643b7b3


compute.log
...
2012-06-15 15:35:42 TRACE nova.rpc.amqp     cmd=' '.join(cmd))
2012-06-15 15:35:42 TRACE nova.rpc.amqp ProcessExecutionError: Unexpected error while running command.
2012-06-15 15:35:42 TRACE nova.rpc.amqp Command: qemu-img convert -O raw /var/lib/nova/instances/_base/ea3be1395477e467b2925ad19d0957fa391b0cbc.part /var/lib/nova/instances/_base/ea3be1395477e467b2925ad19d0957fa391b0cbc.converted
2012-06-15 15:35:42 TRACE nova.rpc.amqp Exit code: 1
2012-06-15 15:35:42 TRACE nova.rpc.amqp Stdout: ''
2012-06-15 15:35:42 TRACE nova.rpc.amqp Stderr: 'qemu-img: error while reading sector 8032256: Input/output error\n'
...


-- test launch of el6-2.qcow2

glance add name=el6-2 is_public=true disk_format=qcow2 container_format=ovf < el6-2.qcow2
Uploading image 'el6-2'
===========================================[100%] 70.1M/s, ETA  0h  0m  0s
Added new image with ID: 30bba89a-bf37-4650-9e41-85957358db15

ckages/nova/utils.py", line 243, in execute
2012-06-15 15:37:33 TRACE nova.rpc.amqp     cmd=' '.join(cmd))
2012-06-15 15:37:33 TRACE nova.rpc.amqp ProcessExecutionError: Unexpected error while running command.
2012-06-15 15:37:33 TRACE nova.rpc.amqp Command: qemu-img convert -O raw /var/lib/nova/instances/_base/70b6322b985f898d824373e90c11efadf55f87dd.part /var/lib/nova/instances/_base/70b6322b985f898d824373e90c11efadf55f87dd.converted
2012-06-15 15:37:33 TRACE nova.rpc.amqp Exit code: 1
2012-06-15 15:37:33 TRACE nova.rpc.amqp Stdout: ''
2012-06-15 15:37:33 TRACE nova.rpc.amqp Stderr: 'qemu-img: error while reading sector 499712: Input/output error\n'

--- virt-install of el6-1.qcow2

There's no immediately obvious storage issues this is a new system with a new f17 install


--- qemu-img check (el6.dsk)

qemu-img: This image format does not support checks

--- qemu-img check (-c -p -f raw -O qcow2 .. el6-1.qcow2)

No errors were found on the image.

--- qemu-img check (-c -O qcow2 .. el6-2.qcow2)

No errors were found on the image.

--- test launch of el6-1.qcow2

glance add name=el6-1 is_public=true disk_format=qcow2 container_format=ovf < el6-1.qcow2
Uploading image 'el6-1'
===========================================[100%] 64.6M/s, ETA  0h  0m  0s
Added new image with ID: aabc8b37-d230-46da-9945-cbbd1643b7b3


compute.log
...
2012-06-15 15:35:42 TRACE nova.rpc.amqp     cmd=' '.join(cmd))
2012-06-15 15:35:42 TRACE nova.rpc.amqp ProcessExecutionError: Unexpected error while running command.
2012-06-15 15:35:42 TRACE nova.rpc.amqp Command: qemu-img convert -O raw /var/lib/nova/instances/_base/ea3be1395477e467b2925ad19d0957fa391b0cbc.part /var/lib/nova/instances/_base/ea3be1395477e467b2925ad19d0957fa391b0cbc.converted
2012-06-15 15:35:42 TRACE nova.rpc.amqp Exit code: 1
2012-06-15 15:35:42 TRACE nova.rpc.amqp Stdout: ''
2012-06-15 15:35:42 TRACE nova.rpc.amqp Stderr: 'qemu-img: error while reading sector 8032256: Input/output error\n'
...


-- test launch of el6-2.qcow2

glance add name=el6-2 is_public=true disk_format=qcow2 container_format=ovf < el6-2.qcow2
Uploading image 'el6-2'
===========================================[100%] 70.1M/s, ETA  0h  0m  0s
Added new image with ID: 30bba89a-bf37-4650-9e41-85957358db15

ckages/nova/utils.py", line 243, in execute
2012-06-15 15:37:33 TRACE nova.rpc.amqp     cmd=' '.join(cmd))
2012-06-15 15:37:33 TRACE nova.rpc.amqp ProcessExecutionError: Unexpected error while running command.
2012-06-15 15:37:33 TRACE nova.rpc.amqp Command: qemu-img convert -O raw /var/lib/nova/instances/_base/70b6322b985f898d824373e90c11efadf55f87dd.part /var/lib/nova/instances/_base/70b6322b985f898d824373e90c11efadf55f87dd.converted
2012-06-15 15:37:33 TRACE nova.rpc.amqp Exit code: 1
2012-06-15 15:37:33 TRACE nova.rpc.amqp Stdout: ''
2012-06-15 15:37:33 TRACE nova.rpc.amqp Stderr: 'qemu-img: error while reading sector 499712: Input/output error\n'


--- virt-install

sudo cp el*.qcow2 /var/lib/libvirt/images


virt-install -n "el6-1" -r 1024 --arch=x86_64 --vcpus=1 --os-type=linux --os-variant=rhel6 --disk path=/var/lib/libvirt/images/el6-1.qcow2,sparse=true,cache=none,device=disk,bus=virtio,format=qcow2 --accelerate --vnc --import

Starting install...
Creating domain...                                 |    0 B     00:00     
WARNING  Unable to connect to graphical console: virt-viewer not installed. Please install the 'virt-viewer' package.
Domain creation completed. You can restart your domain by running:
  virsh --connect qemu:///system start el6-1


Confirmed image boots with no issues at all, console login via virt-manager attached console completed without issue.


 virt-install -n "el6-2" -r 1024 --arch=x86_64 --vcpus=1 --os-type=linux --os-variant=rhel6 --disk path=/var/lib/libvirt/images/el6-2.qcow2,sparse=true,cache=none,device=disk,bus=virtio,format=qcow2 --accelerate --vnc --import

Starting install...
Creating domain...                                 |    0 B     00:00     
WARNING  Unable to connect to graphical console: virt-viewer not installed. Please install the 'virt-viewer' package.
Domain creation completed. You can restart your domain by running:
  virsh --connect qemu:///system start el6-2

Same result instance boots and can login via virt-manager attached console.

--- qemu-img manual repeat of failed command

david@David-Samsung ~$ qemu-img convert -O raw ./el6-1.qcow2 ./el6-1.raw 
david@David-Samsung ~$ $?                                                

zsh: command not found: 0


Note: selinux is running as Permissive for this entire test, as such it's not that.

Comment 4 Pádraig Brady 2012-06-15 15:13:39 UTC
So qemu-img is OK with the original image.

So somewhere in the -> glance -> nova transfer the image is corrupted?
Note glance stores its images by default in /var/lib/glance/images,
so you can compare those to see if glance was at fault.

Comparing the image transfered to nova will be a bit trickier.
I'd modify the fetch_to_raw() function in nova/virt/images.py
to log the length and maybe the md5sum of the ".part" file,
to see if it matches.

BTW is the nova instances dir on a standard ext4 local file system?

Comment 5 David Busby 2012-06-15 15:19:35 UTC
File system is ext4 under lvm

/dev/mapper/vg_davidsamsung-lv_root on / type ext4 (rw,relatime,seclabel,data=ordered) [_Fedora-16-x86_6]

I'll have a go at modding the images.py per your suggestions above.

Comment 6 Angus Salkeld 2012-06-17 12:46:11 UTC
I am getting this too, if I can do something to help let me know - I can't do
any work till this is fixed.

Comment 7 Angus Salkeld 2012-06-18 02:42:54 UTC
This seems to me to be the transfer from glance to nova.

Note: I also can report the image in /var/lib/glance/images/ does work
(Like David shows)

I add the following 2 lines to fetch_to_raw() in /usr/lib/python2.7/site-packages/nova/virt/images.py:63

    out, err = utils.execute('md5sum', path_tmp)
    LOG.info("fetch_to_raw %s" % (out))

When it works the md5sum from this log matches the one in glance.

So a working one:
-----------------

$ glance add name=f16-jeos is_public=true disk_format=qcow2 container_format=ovf < /var/lib/libvirt/images/F16-x86_64-cfntools-jeos.qcow2

# check
md5sum /var/lib/glance/images/972d54e0-ddc7-4649-b939-792d0f78e6b1 
16d476c6a693eda1d56133ffa18f4f91  /var/lib/glance/images/972d54e0-ddc7-4649-b939-792d0f78e6b1

nova boot --flavor m1.small --image 972d54e0-ddc7-4649-b939-792d0f78e6b1 freddy

# debug log message
2012-06-18 11:38:58 INFO nova.virt.images [req-0c219114-aadd-4ffa-ba57-101369a721d3 0977cd825fcd487ca2a9293e23835e5f ac5e752755d84ca3b66734fdd4188d37] fetch_to_raw 16d476c6a693eda1d56133ffa18f4f91  /var/lib/nova/instances/_base/b8c3b9799fc761ebd6ecdd3d7956532301c0d842.part

So both 16d476c6a693eda1d56133ffa18f4f91

A broken one: (different base image)
-------------

nova image-list
+--------------------------------------+---------------------+--------+--------+
|                  ID                  |         Name        | Status | Server |
+--------------------------------------+---------------------+--------+--------+
| 0b364a2e-64c7-4b97-ad6d-1f8c06380af3 | F16-x86_64-cfntools | ACTIVE |        |
+--------------------------------------+---------------------+--------+--------+

md5sum /var/lib/glance/images/0b364a2e-64c7-4b97-ad6d-1f8c06380af3 
45684826559792e1a3e25e0990cd3fd6  /var/lib/glance/images/0b364a2e-64c7-4b97-ad6d-1f8c06380af3

nova boot --flavor m1.small --image 0b364a2e-64c7-4b97-ad6d-1f8c06380af3 freddy3

2012-06-18 12:40:17 INFO nova.virt.images [req-ad856d45-7429-4c04-a5f4-9d0b70f147ab 0977cd825fcd487ca2a9293e23835e5f ac5e752755d84ca3b66734fdd4188d37] fetch_to_raw ee98c793baaf340eaef8a05e8ddc1137  /var/lib/nova/instances/_base/2dfbdba2c878b4d800f762b40e9d9ea21c128841.part

45684826559792e1a3e25e0990cd3fd6 != ee98c793baaf340eaef8a05e8ddc1137

Comment 8 David Busby 2012-06-18 08:43:06 UTC
Thanks Agnus, came to do this, this morning and you've already done it!

I am applying your noted code updates and running through the tests with my el6-1 and el6-2 images today, will update these tickets with the results.

Comment 9 Pádraig Brady 2012-06-18 09:38:34 UTC
Hi guys, could you also print the size of the 'path_tmp' too, so ssee might it be a truncation issue rather than a corruption one?

Angus, was there any significant difference between the two images you tried,
or are you saying it happens only sometimes?

As a wild guess one setting that might change things is, setting 'libvirt_nonblocking = False' in /etc/nova/nova.conf

As a temp workaround you could set 'force_raw_images = False' in /etc/nova/nova.conf

Comment 10 Angus Salkeld 2012-06-18 23:04:48 UTC
(In reply to comment #9)
> Hi guys, could you also print the size of the 'path_tmp' too, so ssee might
> it be a truncation issue rather than a corruption one?

I think it is (truncation), but I'll make sure.

> 
> Angus, was there any significant difference between the two images you tried,
> or are you saying it happens only sometimes?

No real difference, I was just re-building the same jeos.
It mostly fails - occasionally it works

> 
> As a wild guess one setting that might change things is, setting
> 'libvirt_nonblocking = False' in /etc/nova/nova.conf
> 
> As a temp workaround you could set 'force_raw_images = False' in
> /etc/nova/nova.conf

Cool, I'll give those a go.

Note: I am using an ssd - tho' this has been working for ages.

Comment 11 Angus Salkeld 2012-06-19 04:19:37 UTC
So I noticed a couple of unusual rpm issues:
- qpid (got updated and the daemon was mistakenly obsoleted)
- I was using the glance from http://repos.fedorapeople.org/repos/apevec/openstack-preview/fedora-$releasever/noarch/

So I re-installed all related openstack and qpid rpms.
openstack from the updates-testing repo
qpidd from koji.fedoraproject.org/koji/buildinfo?buildID=326079

Now I don't get this bug - so I am happy :-)

Comment 12 Pádraig Brady 2012-06-19 08:38:37 UTC
So Angus you effectively downgraded glance from openstack-glance-2012.1-8 to openstack-glance-2012.1-4.

David I presume you're using openstack-glance-2012.1-4.fc17 ?

Comment 13 David Busby 2012-06-19 15:27:30 UTC
Hi Pádraig,

Yes I am.

--
openstack-glance-2012.1-4.fc17.noarch
openstack-nova-2012.1-5.fc17.noarch
openstack-keystone-2012.1-1.fc17.noarch
openstack-dashboard-2012.1-3.fc17.noarch
openstack-nova-doc-2012.1-5.fc17.noarch
openstack-utils-2012.1-1.fc17.noarch
openstack-quantum-2012.1-2.fc17.noarch
openstack-glance-doc-2012.1-4.fc17.noarch
---

Apologies for not getting around to retesting; work commitments have removed any time I have had to work on this.

Note: I have found the following this morning when attempting to update:

---
qpid-cpp-client = 0.14-1.fc17.2 is needed by (installed) qpid-cpp-server-0.14-1.fc17.2.x86_64
Please report this error in https://bugzilla.redhat.com/enter_bug.cgi?product=Fedora&version=rawhide&component=yum
** Found 1 pre-existing rpmdb problem(s), 'yum check' output follows:
hfsplus-tools-540.1.linux3-1.fc17.x86_64 has installed conflicts hfsplusutils: hfsplusutils-1.0.4-16.fc17.x86_64
Your transaction was saved, rerun it with:
---

unsure if this is related.

Comment 14 Pádraig Brady 2012-06-19 16:24:25 UTC
qpid-cpp-server is borked in the repos at present.
You can get the latest direct from Koji or wait "a short while" for it to filter out. I don't think that should impact this.

So both openstack-glance-2012.1-4.fc17 and openstack-glance-2012.1-8.fc17 are impacted. Angus' downgrade to one to the other must only have changed timings, masking the issue.

Nice, a heisenbug :(

I'll reassign to glance since it seems the more likely source.
Eoghan any hints on why glance might intermittently provide partial images?
I must look at enhancing nova to checksum the image as it streams it to cache.

Comment 15 Alan Pevec 2012-06-25 21:31:20 UTC
Setting needinfo flag for Eoghan:
> Eoghan any hints on why glance might intermittently provide partial images?

Comment 16 Eoghan Glynn 2012-07-02 17:34:08 UTC
Apologies for the late response, I didn't see the request for info.

Now, I haven't seen a case previously where glance truncated images during download.

Can Angus or David indicate whether truncation as opposed to corruption was actually confirmed (as requested by Pádraig above, but it's unclear to me from the comments trail whether this was fully confirmed or assumed).

One quick way of checking would be to add a few additional lines of debug to nova/virt/images.py near the previously added extra logging, something like:

    glance_path = os.path.join('/var/lib/glance/images',
                               image_href.split('/')[-1])
    out, err = utils.execute('cmp', path_tmp, glance_path)
    LOG.info("comparing fetched image %s to master %s: %s" %
             (path_tmp, glance_path, out))

Also, to aid in reproducing this issue in isolation, can you make the problematic image (/path/to/el6.qcow2) available somewhere for me to test? Or even just give an indication of its size.

Finally, can you an indication of how reproducible this is in your environment against openstack-glance-2012.1-8 ... i.e. what % of attempts to download will fail.

Comment 17 David Busby 2012-07-02 21:16:37 UTC
Whilst I can't say for sure it was the cause; once the qpidd update issues cleared I am not seeing this issue occur again.

--
openstack-glance-2012.1-4.fc17.noarch
openstack-nova-2012.1-10.fc17.noarch
openstack-keystone-2012.1-5.fc17.noarch
openstack-dashboard-2012.1-3.fc17.noarch
openstack-nova-doc-2012.1-10.fc17.noarch
openstack-utils-2012.1-1.fc17.noarch
openstack-quantum-2012.1-2.fc17.noarch
openstack-glance-doc-2012.1-4.fc17.noarch
---

unsure on how to proceed on this one, in short the issue has "vanished" though I don't like not knowing why.

Comment 18 Eoghan Glynn 2012-07-03 12:15:25 UTC
So in an attempt to reproduce this, I've taken a ~200Mb qcow2 image, fetched and converted it literally 1000s of times via the same codepath as used in nova.virt.images.fetch().

I've tested against both:

  openstack-glance-2012.1-4.fc17.noarch

and:

  openstack-glance-2012.1-8.fc17.noarch

Not a single case of corruption/truncation occurred, nor did 'qemu-img convert -O raw' ever fail.

Now I could submit a patch to nova that checks the downloaded image size/checksum against the image metadata, retrying the fetch in the case of a mismatch.

However, I'm reluctant to propose such a patch without a reproducible failure.

I'm wondering if there's something simpler going on here. For example, could there have been transient disk-space starvation under /var/lib/nova/instances?

David - have you in any case retained the problematic image file that cuases 'qemu-img convert -O raw' to fail? For example:

 /var/lib/nova/instances/_base/2ea04dba5f407cbd19093a0bfe27c04e1a561cb8.part

If so, this might provide some forensic clues, for example the extent of the corruption/truncation.

Also can you check the version qemu-img being used, just so we're definitely on the same page there:

  $ rpm -qa | grep qemu-img
  qemu-img-1.0-17.fc17.x86_64

Comment 19 Eoghan Glynn 2012-07-03 14:19:49 UTC
Note that in order to have retained the problematic image .part file, you would have needed to suppress the preceeding utils.remove_path_on_error() in nova/virt/images.py.

Comment 20 Eoghan Glynn 2012-07-04 12:56:33 UTC
David - since this issue is proving very difficult to reproduce, we need to forensically go through the packages updated at and around the times you saw the problem appear and disappear.

Can you attach your recent yum history:

$ for h in `sudo yum history list | grep 2012-06 | cut -f1 -d'|' | sort -n` ; do sudo yum history info $h ; done > /tmp/yum.history_2012-06 
$ sudo grep Jun /var/log/yum.log > /tmp/yum.log_2012-06

Also, please give an indication of the precise dates you saw the problem first appear then disappear.

Thanks,
Eoghan

Comment 21 Eoghan Glynn 2012-07-04 13:03:46 UTC
Created attachment 596218 [details]
Angus Salkeld's yum.history

Comment 22 Angus Salkeld 2012-07-04 13:17:35 UTC
Created attachment 596220 [details]
yum log

Comment 23 David Busby 2012-07-04 13:24:20 UTC
Created attachment 596222 [details]
YUm history

I raised this BZ the same day I had the issue occur, unfortunately I do not have the invalid .part image. date of issue 15/06/2012

Comment 24 David Busby 2012-07-04 13:25:28 UTC
Created attachment 596223 [details]
yum log

I raised this BZ the same day I had the issue occur, unfortunately I do not have the invalid .part image. date of issue 15/06/2012

Comment 25 Eoghan Glynn 2012-07-09 18:06:39 UTC
David, Angus,

I'm minded to close this issue as it does not appear reproducible against the current versions of the packages, and a forensic trawl through yum logs to isolate the package that caused the issue is looking a bit intractable. I don't think its a productive use of time to chase the issue for purely historical interest.

What is your feeling on this?

Regards,
Eoghan

Comment 26 David Busby 2012-07-09 18:29:00 UTC
Eoghan,

I am of the same opinion.

Cheers

David

Comment 27 Eoghan Glynn 2012-07-09 20:18:16 UTC
Closing with works-for-me, as discussed above.

Cheers,
Eoghan


Note You need to log in before you can comment on or make changes to this bug.