Description of problem: el6 build from CentOS6.2 test using instructions on wiki for fedora jeos. Version-Release number of selected component (if applicable): 2012.1-5 How reproducible: Build el6 jeos image using oz-install, convert to qcow2 and import into glance Steps to Reproduce: 1. sudo oz-install el6.tdl 2. convert -c -p -f raw -O qcow2 /var/lib/libvirt/images/el6.dsk /path/to/el6.qcow2 3. glance add name=el6-jeos is_public=true disk_format=qcow2 container_format=ovf < /path/to/el6.qcow2 Actual results: compute.log ... 2012-06-15 13:10:20 TRACE nova.rpc.amqp File "/usr/lib/python2.7/site-packages/nova/virt/images.py", line 99, in fetch_to_raw 2012-06-15 13:10:20 TRACE nova.rpc.amqp path_tmp, staged) 2012-06-15 13:10:20 TRACE nova.rpc.amqp File "/usr/lib/python2.7/site-packages/nova/utils.py", line 243, in execute 2012-06-15 13:10:20 TRACE nova.rpc.amqp cmd=' '.join(cmd)) 2012-06-15 13:10:20 TRACE nova.rpc.amqp ProcessExecutionError: Unexpected error while running command. 2012-06-15 13:10:20 TRACE nova.rpc.amqp Command: qemu-img convert -O raw /var/lib/nova/instances/_base/2ea04dba5f407cbd19093a0bfe27c04e1a561cb8.part /var/lib/nova/instances/_base/2ea04dba5f407cbd19093a0bfe27c04e1a561cb8.converted 2012-06-15 13:10:20 TRACE nova.rpc.amqp Exit code: 1 2012-06-15 13:10:20 TRACE nova.rpc.amqp Stdout: '' 2012-06-15 13:10:20 TRACE nova.rpc.amqp Stderr: 'qemu-img: error while reading sector 12734464: Input/output error\n' ... Expected results: Started instance.
Are you really using "convert" to convert the image? We successfully follow steps very similar to the ones you outline, but use qemu-img convert -c -O qcow2 /var/lib/libvirt/images/el6.dsk /path/to/el6.qcow2 to convert the image.
So maybe this should be moved to the qemu package, as there seem to be other reports (though this is the other direction): http://lists.gnu.org/archive/html/qemu-devel/2011-09/msg03847.html I/O error is a bit strange though. Are you sure there no issue with storage media etc? Also I'd qemu-info check the dsk image and also try to boot it to see if it's OK itself.
There's no immediately obvious storage issues this is a new system with a new f17 install --- qemu-img check (el6.dsk) qemu-img: This image format does not support checks --- qemu-img check (-c -p -f raw -O qcow2 .. el6-1.qcow2) No errors were found on the image. --- qemu-img check (-c -O qcow2 .. el6-2.qcow2) No errors were found on the image. --- test launch of el6-1.qcow2 glance add name=el6-1 is_public=true disk_format=qcow2 container_format=ovf < el6-1.qcow2 Uploading image 'el6-1' ===========================================[100%] 64.6M/s, ETA 0h 0m 0s Added new image with ID: aabc8b37-d230-46da-9945-cbbd1643b7b3 compute.log ... 2012-06-15 15:35:42 TRACE nova.rpc.amqp cmd=' '.join(cmd)) 2012-06-15 15:35:42 TRACE nova.rpc.amqp ProcessExecutionError: Unexpected error while running command. 2012-06-15 15:35:42 TRACE nova.rpc.amqp Command: qemu-img convert -O raw /var/lib/nova/instances/_base/ea3be1395477e467b2925ad19d0957fa391b0cbc.part /var/lib/nova/instances/_base/ea3be1395477e467b2925ad19d0957fa391b0cbc.converted 2012-06-15 15:35:42 TRACE nova.rpc.amqp Exit code: 1 2012-06-15 15:35:42 TRACE nova.rpc.amqp Stdout: '' 2012-06-15 15:35:42 TRACE nova.rpc.amqp Stderr: 'qemu-img: error while reading sector 8032256: Input/output error\n' ... -- test launch of el6-2.qcow2 glance add name=el6-2 is_public=true disk_format=qcow2 container_format=ovf < el6-2.qcow2 Uploading image 'el6-2' ===========================================[100%] 70.1M/s, ETA 0h 0m 0s Added new image with ID: 30bba89a-bf37-4650-9e41-85957358db15 ckages/nova/utils.py", line 243, in execute 2012-06-15 15:37:33 TRACE nova.rpc.amqp cmd=' '.join(cmd)) 2012-06-15 15:37:33 TRACE nova.rpc.amqp ProcessExecutionError: Unexpected error while running command. 2012-06-15 15:37:33 TRACE nova.rpc.amqp Command: qemu-img convert -O raw /var/lib/nova/instances/_base/70b6322b985f898d824373e90c11efadf55f87dd.part /var/lib/nova/instances/_base/70b6322b985f898d824373e90c11efadf55f87dd.converted 2012-06-15 15:37:33 TRACE nova.rpc.amqp Exit code: 1 2012-06-15 15:37:33 TRACE nova.rpc.amqp Stdout: '' 2012-06-15 15:37:33 TRACE nova.rpc.amqp Stderr: 'qemu-img: error while reading sector 499712: Input/output error\n' --- virt-install of el6-1.qcow2 There's no immediately obvious storage issues this is a new system with a new f17 install --- qemu-img check (el6.dsk) qemu-img: This image format does not support checks --- qemu-img check (-c -p -f raw -O qcow2 .. el6-1.qcow2) No errors were found on the image. --- qemu-img check (-c -O qcow2 .. el6-2.qcow2) No errors were found on the image. --- test launch of el6-1.qcow2 glance add name=el6-1 is_public=true disk_format=qcow2 container_format=ovf < el6-1.qcow2 Uploading image 'el6-1' ===========================================[100%] 64.6M/s, ETA 0h 0m 0s Added new image with ID: aabc8b37-d230-46da-9945-cbbd1643b7b3 compute.log ... 2012-06-15 15:35:42 TRACE nova.rpc.amqp cmd=' '.join(cmd)) 2012-06-15 15:35:42 TRACE nova.rpc.amqp ProcessExecutionError: Unexpected error while running command. 2012-06-15 15:35:42 TRACE nova.rpc.amqp Command: qemu-img convert -O raw /var/lib/nova/instances/_base/ea3be1395477e467b2925ad19d0957fa391b0cbc.part /var/lib/nova/instances/_base/ea3be1395477e467b2925ad19d0957fa391b0cbc.converted 2012-06-15 15:35:42 TRACE nova.rpc.amqp Exit code: 1 2012-06-15 15:35:42 TRACE nova.rpc.amqp Stdout: '' 2012-06-15 15:35:42 TRACE nova.rpc.amqp Stderr: 'qemu-img: error while reading sector 8032256: Input/output error\n' ... -- test launch of el6-2.qcow2 glance add name=el6-2 is_public=true disk_format=qcow2 container_format=ovf < el6-2.qcow2 Uploading image 'el6-2' ===========================================[100%] 70.1M/s, ETA 0h 0m 0s Added new image with ID: 30bba89a-bf37-4650-9e41-85957358db15 ckages/nova/utils.py", line 243, in execute 2012-06-15 15:37:33 TRACE nova.rpc.amqp cmd=' '.join(cmd)) 2012-06-15 15:37:33 TRACE nova.rpc.amqp ProcessExecutionError: Unexpected error while running command. 2012-06-15 15:37:33 TRACE nova.rpc.amqp Command: qemu-img convert -O raw /var/lib/nova/instances/_base/70b6322b985f898d824373e90c11efadf55f87dd.part /var/lib/nova/instances/_base/70b6322b985f898d824373e90c11efadf55f87dd.converted 2012-06-15 15:37:33 TRACE nova.rpc.amqp Exit code: 1 2012-06-15 15:37:33 TRACE nova.rpc.amqp Stdout: '' 2012-06-15 15:37:33 TRACE nova.rpc.amqp Stderr: 'qemu-img: error while reading sector 499712: Input/output error\n' --- virt-install sudo cp el*.qcow2 /var/lib/libvirt/images virt-install -n "el6-1" -r 1024 --arch=x86_64 --vcpus=1 --os-type=linux --os-variant=rhel6 --disk path=/var/lib/libvirt/images/el6-1.qcow2,sparse=true,cache=none,device=disk,bus=virtio,format=qcow2 --accelerate --vnc --import Starting install... Creating domain... | 0 B 00:00 WARNING Unable to connect to graphical console: virt-viewer not installed. Please install the 'virt-viewer' package. Domain creation completed. You can restart your domain by running: virsh --connect qemu:///system start el6-1 Confirmed image boots with no issues at all, console login via virt-manager attached console completed without issue. virt-install -n "el6-2" -r 1024 --arch=x86_64 --vcpus=1 --os-type=linux --os-variant=rhel6 --disk path=/var/lib/libvirt/images/el6-2.qcow2,sparse=true,cache=none,device=disk,bus=virtio,format=qcow2 --accelerate --vnc --import Starting install... Creating domain... | 0 B 00:00 WARNING Unable to connect to graphical console: virt-viewer not installed. Please install the 'virt-viewer' package. Domain creation completed. You can restart your domain by running: virsh --connect qemu:///system start el6-2 Same result instance boots and can login via virt-manager attached console. --- qemu-img manual repeat of failed command david@David-Samsung ~$ qemu-img convert -O raw ./el6-1.qcow2 ./el6-1.raw david@David-Samsung ~$ $? zsh: command not found: 0 Note: selinux is running as Permissive for this entire test, as such it's not that.
So qemu-img is OK with the original image. So somewhere in the -> glance -> nova transfer the image is corrupted? Note glance stores its images by default in /var/lib/glance/images, so you can compare those to see if glance was at fault. Comparing the image transfered to nova will be a bit trickier. I'd modify the fetch_to_raw() function in nova/virt/images.py to log the length and maybe the md5sum of the ".part" file, to see if it matches. BTW is the nova instances dir on a standard ext4 local file system?
File system is ext4 under lvm /dev/mapper/vg_davidsamsung-lv_root on / type ext4 (rw,relatime,seclabel,data=ordered) [_Fedora-16-x86_6] I'll have a go at modding the images.py per your suggestions above.
I am getting this too, if I can do something to help let me know - I can't do any work till this is fixed.
This seems to me to be the transfer from glance to nova. Note: I also can report the image in /var/lib/glance/images/ does work (Like David shows) I add the following 2 lines to fetch_to_raw() in /usr/lib/python2.7/site-packages/nova/virt/images.py:63 out, err = utils.execute('md5sum', path_tmp) LOG.info("fetch_to_raw %s" % (out)) When it works the md5sum from this log matches the one in glance. So a working one: ----------------- $ glance add name=f16-jeos is_public=true disk_format=qcow2 container_format=ovf < /var/lib/libvirt/images/F16-x86_64-cfntools-jeos.qcow2 # check md5sum /var/lib/glance/images/972d54e0-ddc7-4649-b939-792d0f78e6b1 16d476c6a693eda1d56133ffa18f4f91 /var/lib/glance/images/972d54e0-ddc7-4649-b939-792d0f78e6b1 nova boot --flavor m1.small --image 972d54e0-ddc7-4649-b939-792d0f78e6b1 freddy # debug log message 2012-06-18 11:38:58 INFO nova.virt.images [req-0c219114-aadd-4ffa-ba57-101369a721d3 0977cd825fcd487ca2a9293e23835e5f ac5e752755d84ca3b66734fdd4188d37] fetch_to_raw 16d476c6a693eda1d56133ffa18f4f91 /var/lib/nova/instances/_base/b8c3b9799fc761ebd6ecdd3d7956532301c0d842.part So both 16d476c6a693eda1d56133ffa18f4f91 A broken one: (different base image) ------------- nova image-list +--------------------------------------+---------------------+--------+--------+ | ID | Name | Status | Server | +--------------------------------------+---------------------+--------+--------+ | 0b364a2e-64c7-4b97-ad6d-1f8c06380af3 | F16-x86_64-cfntools | ACTIVE | | +--------------------------------------+---------------------+--------+--------+ md5sum /var/lib/glance/images/0b364a2e-64c7-4b97-ad6d-1f8c06380af3 45684826559792e1a3e25e0990cd3fd6 /var/lib/glance/images/0b364a2e-64c7-4b97-ad6d-1f8c06380af3 nova boot --flavor m1.small --image 0b364a2e-64c7-4b97-ad6d-1f8c06380af3 freddy3 2012-06-18 12:40:17 INFO nova.virt.images [req-ad856d45-7429-4c04-a5f4-9d0b70f147ab 0977cd825fcd487ca2a9293e23835e5f ac5e752755d84ca3b66734fdd4188d37] fetch_to_raw ee98c793baaf340eaef8a05e8ddc1137 /var/lib/nova/instances/_base/2dfbdba2c878b4d800f762b40e9d9ea21c128841.part 45684826559792e1a3e25e0990cd3fd6 != ee98c793baaf340eaef8a05e8ddc1137
Thanks Agnus, came to do this, this morning and you've already done it! I am applying your noted code updates and running through the tests with my el6-1 and el6-2 images today, will update these tickets with the results.
Hi guys, could you also print the size of the 'path_tmp' too, so ssee might it be a truncation issue rather than a corruption one? Angus, was there any significant difference between the two images you tried, or are you saying it happens only sometimes? As a wild guess one setting that might change things is, setting 'libvirt_nonblocking = False' in /etc/nova/nova.conf As a temp workaround you could set 'force_raw_images = False' in /etc/nova/nova.conf
(In reply to comment #9) > Hi guys, could you also print the size of the 'path_tmp' too, so ssee might > it be a truncation issue rather than a corruption one? I think it is (truncation), but I'll make sure. > > Angus, was there any significant difference between the two images you tried, > or are you saying it happens only sometimes? No real difference, I was just re-building the same jeos. It mostly fails - occasionally it works > > As a wild guess one setting that might change things is, setting > 'libvirt_nonblocking = False' in /etc/nova/nova.conf > > As a temp workaround you could set 'force_raw_images = False' in > /etc/nova/nova.conf Cool, I'll give those a go. Note: I am using an ssd - tho' this has been working for ages.
So I noticed a couple of unusual rpm issues: - qpid (got updated and the daemon was mistakenly obsoleted) - I was using the glance from http://repos.fedorapeople.org/repos/apevec/openstack-preview/fedora-$releasever/noarch/ So I re-installed all related openstack and qpid rpms. openstack from the updates-testing repo qpidd from koji.fedoraproject.org/koji/buildinfo?buildID=326079 Now I don't get this bug - so I am happy :-)
So Angus you effectively downgraded glance from openstack-glance-2012.1-8 to openstack-glance-2012.1-4. David I presume you're using openstack-glance-2012.1-4.fc17 ?
Hi Pádraig, Yes I am. -- openstack-glance-2012.1-4.fc17.noarch openstack-nova-2012.1-5.fc17.noarch openstack-keystone-2012.1-1.fc17.noarch openstack-dashboard-2012.1-3.fc17.noarch openstack-nova-doc-2012.1-5.fc17.noarch openstack-utils-2012.1-1.fc17.noarch openstack-quantum-2012.1-2.fc17.noarch openstack-glance-doc-2012.1-4.fc17.noarch --- Apologies for not getting around to retesting; work commitments have removed any time I have had to work on this. Note: I have found the following this morning when attempting to update: --- qpid-cpp-client = 0.14-1.fc17.2 is needed by (installed) qpid-cpp-server-0.14-1.fc17.2.x86_64 Please report this error in https://bugzilla.redhat.com/enter_bug.cgi?product=Fedora&version=rawhide&component=yum ** Found 1 pre-existing rpmdb problem(s), 'yum check' output follows: hfsplus-tools-540.1.linux3-1.fc17.x86_64 has installed conflicts hfsplusutils: hfsplusutils-1.0.4-16.fc17.x86_64 Your transaction was saved, rerun it with: --- unsure if this is related.
qpid-cpp-server is borked in the repos at present. You can get the latest direct from Koji or wait "a short while" for it to filter out. I don't think that should impact this. So both openstack-glance-2012.1-4.fc17 and openstack-glance-2012.1-8.fc17 are impacted. Angus' downgrade to one to the other must only have changed timings, masking the issue. Nice, a heisenbug :( I'll reassign to glance since it seems the more likely source. Eoghan any hints on why glance might intermittently provide partial images? I must look at enhancing nova to checksum the image as it streams it to cache.
Setting needinfo flag for Eoghan: > Eoghan any hints on why glance might intermittently provide partial images?
Apologies for the late response, I didn't see the request for info. Now, I haven't seen a case previously where glance truncated images during download. Can Angus or David indicate whether truncation as opposed to corruption was actually confirmed (as requested by Pádraig above, but it's unclear to me from the comments trail whether this was fully confirmed or assumed). One quick way of checking would be to add a few additional lines of debug to nova/virt/images.py near the previously added extra logging, something like: glance_path = os.path.join('/var/lib/glance/images', image_href.split('/')[-1]) out, err = utils.execute('cmp', path_tmp, glance_path) LOG.info("comparing fetched image %s to master %s: %s" % (path_tmp, glance_path, out)) Also, to aid in reproducing this issue in isolation, can you make the problematic image (/path/to/el6.qcow2) available somewhere for me to test? Or even just give an indication of its size. Finally, can you an indication of how reproducible this is in your environment against openstack-glance-2012.1-8 ... i.e. what % of attempts to download will fail.
Whilst I can't say for sure it was the cause; once the qpidd update issues cleared I am not seeing this issue occur again. -- openstack-glance-2012.1-4.fc17.noarch openstack-nova-2012.1-10.fc17.noarch openstack-keystone-2012.1-5.fc17.noarch openstack-dashboard-2012.1-3.fc17.noarch openstack-nova-doc-2012.1-10.fc17.noarch openstack-utils-2012.1-1.fc17.noarch openstack-quantum-2012.1-2.fc17.noarch openstack-glance-doc-2012.1-4.fc17.noarch --- unsure on how to proceed on this one, in short the issue has "vanished" though I don't like not knowing why.
So in an attempt to reproduce this, I've taken a ~200Mb qcow2 image, fetched and converted it literally 1000s of times via the same codepath as used in nova.virt.images.fetch(). I've tested against both: openstack-glance-2012.1-4.fc17.noarch and: openstack-glance-2012.1-8.fc17.noarch Not a single case of corruption/truncation occurred, nor did 'qemu-img convert -O raw' ever fail. Now I could submit a patch to nova that checks the downloaded image size/checksum against the image metadata, retrying the fetch in the case of a mismatch. However, I'm reluctant to propose such a patch without a reproducible failure. I'm wondering if there's something simpler going on here. For example, could there have been transient disk-space starvation under /var/lib/nova/instances? David - have you in any case retained the problematic image file that cuases 'qemu-img convert -O raw' to fail? For example: /var/lib/nova/instances/_base/2ea04dba5f407cbd19093a0bfe27c04e1a561cb8.part If so, this might provide some forensic clues, for example the extent of the corruption/truncation. Also can you check the version qemu-img being used, just so we're definitely on the same page there: $ rpm -qa | grep qemu-img qemu-img-1.0-17.fc17.x86_64
Note that in order to have retained the problematic image .part file, you would have needed to suppress the preceeding utils.remove_path_on_error() in nova/virt/images.py.
David - since this issue is proving very difficult to reproduce, we need to forensically go through the packages updated at and around the times you saw the problem appear and disappear. Can you attach your recent yum history: $ for h in `sudo yum history list | grep 2012-06 | cut -f1 -d'|' | sort -n` ; do sudo yum history info $h ; done > /tmp/yum.history_2012-06 $ sudo grep Jun /var/log/yum.log > /tmp/yum.log_2012-06 Also, please give an indication of the precise dates you saw the problem first appear then disappear. Thanks, Eoghan
Created attachment 596218 [details] Angus Salkeld's yum.history
Created attachment 596220 [details] yum log
Created attachment 596222 [details] YUm history I raised this BZ the same day I had the issue occur, unfortunately I do not have the invalid .part image. date of issue 15/06/2012
Created attachment 596223 [details] yum log I raised this BZ the same day I had the issue occur, unfortunately I do not have the invalid .part image. date of issue 15/06/2012
David, Angus, I'm minded to close this issue as it does not appear reproducible against the current versions of the packages, and a forensic trawl through yum logs to isolate the package that caused the issue is looking a bit intractable. I don't think its a productive use of time to chase the issue for purely historical interest. What is your feeling on this? Regards, Eoghan
Eoghan, I am of the same opinion. Cheers David
Closing with works-for-me, as discussed above. Cheers, Eoghan