Description of problem: I am handling a customer support case from Cisco and here are 2 findings, 1. downloading glance image using "rbd export" performances much faster than glance image-download 2. glance image-download using local repository performance much faster than using rbd backend Below is from customer's update ~~~~~~~~~~~~~ See the following timing, while "rbd export" started at the same time as "glance image-download". But "rbd export" finished in 12s while it took 1m15s for "glance image-download". During the "glance image-download", the progress meter shows frequent "pause and go" but "rbd export" does not. If we stage the exact same image on local file system of the glance-api node and disable the rbd data store option in glance-api.conf, "glance image-download" will fly through under 15s. So that proves no issue with general node configuration. root@csm-a-infra-001:~# date; time glance --os-image-url http://csx-a-glancectl-004:9292 image-download --file /dev/null --progress 68ac8018-f120-4dcd-8a7d-e1c66c4fe91a; date Mon Sep 12 12:55:15 UTC 2016 [=============================>] 100% real 1m15.741s user 0m4.504s sys 0m8.000s Mon Sep 12 12:56:30 UTC 2016 [root@csx-a-glancectl-004 ~]# date; rm -f /tmp/wgs.qcow2; time rbd export csx-a-aio-glance-image-1/68ac8018-f120-4dcd-8a7d-e1c66c4fe91a /tmp/wgs.qcow2 --id csx-a-aio-glance-user; date Mon Sep 12 12:55:16 UTC 2016 2016-09-12 12:55:17.065666 7fb7efc747c0 -1 asok(0x3659970) AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to bind the UNIX domain socket to '/var/run/ceph/rbd-25748.asok': (2) No such file or directory Exporting image: 100% complete...done. real 0m12.252s user 0m1.336s sys 0m9.930s Mon Sep 12 12:55:29 UTC 2016 ~~~~~~~~~~~~~ Wondering if there's any configuration to improve the performance or if it is as expected and need to explain customer. customer is aware of following blueprint https://blueprints.launchpad.net/nova/+spec/direct-download-for-rbd Version-Release number of selected component (if applicable): RHEL OSP 5 on 7 python-glanceclient-0.13.1-1.el7ost.noarch openstack-glance-2014.2.3-3.el7ost.noarch How reproducible: Always Steps to Reproduce: 1.Configure glance using rbd backend 2.compare "rbd export" and "glance image-download" scenario 2 1.time glance image-download using rbd backend. 2.reconfiguring glance using local repository, and download the same qcow2 image to compare. Actual results: rbd export much faster than glance image-download glance image-download using local repository much faster than rbd backend Expected results: Anyway to tune for better performance? Additional info:
Unfortunately, there's no configuration option to make this faster. The reason downloading from rbd is faster than downloading from glance-api is that in the latter case you are basically downloading the image two times (Reading from rbd and then downloading from HTTP). It's not a direct stream. The way this is solved architecturally is by having glance-cache deployed along with the compute node. The work on the direct image download from nova is still WIP (mostly discussion phase) but it depends on other work too.
Hello Flavio, Thanks for the explanation. Customer is also interested in why glance image-download from local repository performs better than from rbd backend. "If we stage the exact same image on local file system of the glance-api node and disable the rbd data store option in glance-api.conf, "glance image-download" will fly through under 15s."
(In reply to James Biao from comment #4) > Hello Flavio, > > Thanks for the explanation. Customer is also interested in why glance > image-download from local repository performs better than from rbd backend. > > "If we stage the exact same image on local file system of the glance-api > node and disable the rbd data store option in glance-api.conf, "glance > image-download" will fly through under 15s." Hey James, As I explained in my previous comment, reading from the network (rbd, in this case) will make the download process slower than reading from the local filesystem. Is the network fast enough? Is there enough bandwidth? It's very likely they won't be able to match the same performance results they are seeing with the local cache. This has improved a bit in newer releases, btw.
No the network is not the bottleneck. If glance api rbd driver has been performing store (to local file system) and forwarding the cached local image, it will be fast than the current streaming mechanism. It is not clear to me why it is acceptable for the suboptimal code of the streaming. (1) store and forwarding (hyperthetic process) ceph/rbd to a local file system: 15s glance api send the local stored image file to client: 15-17s (2) streaming (current mechanism without caching to local file system on the glance-api node) glance api perform rbd download and streaming to client: 1m30~10m. It seems that streaming code path for rbd in glance has very poor efficiency. --weiguo
So, I tried to reproduce this on an environment Avi set up for me. With a 1.5GB image, these are my results: [+] Glance image-download (should be quite slow) real 0m20.095s user 0m13.867s sys 0m4.452s [+] rbd export (should be quite fast) Exporting image: 100% complete...done. real 0m12.831s user 0m4.727s sys 0m6.255s Indeed, "rbd export" is faster than "glance image-download", but the latter does not seem *that* slow. I'm not sure I've really managed to reproduce the issue. What is the size of the image you use? I'm getting similar results when running the API calls manually using curl (~18.5seconds), so if there is an issue, it is probably not with the client.
4.5GB raw image is what we have used for testing. --weiguo
To add, we have also observed similar slowness for "glance image-create" process. --weiguo
Thanks for the additional information. I'm not a glance expert, but seeing the size of the image makes me think about a couple of things and also raises some questions. From scanning the bug and blueprint, it looks to me like there is a significant difference in the 2 paths in addition to the fact a lot has changed in the code paths from OSP5 to OSP10. * What was the result of trying Glance caching mentioned back in #c3? I would think the initial load would suffer but then subsequent attempts would be better? This seems to be the only known mitigation path in OSP5. * Has this always been a problem in the OSP5 production environment? Knowing if images sizes have grown or load could be affecting the xfer paths would be helpful. * The numbers here may be basic, but I would believe that glance fetching the image, converting it and then nova getting the image (conversion + 2 hops), doesn't have much of a chance of being as fast as the RBD export (1 hop). Again, I'm definitely not an expert but have added some folks to the bug to comment.
Hi Paul, So here is the response to your question. * What was the result of trying Glance caching mentioned back in #c3? I would think the initial load would suffer but then subsequent attempts would be better? This seems to be the only known mitigation path in OSP5. - we didn't test it as it doesnt suit our use case very much. Though glance cache will help improving the instance creation in some scenario but it didn't help if our user upload a new image or create a new image from snapshot. Also the same behavior applies to glance image-create too and the upload scenario can't be improved by glance image cache. * Has this always been a problem in the OSP5 production environment? Knowing if images sizes have grown or load could be affecting the xfer paths would be helpful. - we only have OSP5 environment and dont have any other OSP version of environment. * The numbers here may be basic, but I would believe that glance fetching the image, converting it and then nova getting the image (conversion + 2 hops), doesn't have much of a chance of being as fast as the RBD export (1 hop). Again, I'm definitely not an expert but have added some folks to the bug to comment. - yes, we understand glance download to be slower than rbd export but not in a scale that we provided (kind of like 10 mins vs 1 mins download speed).
Thanks for the response, trying to gather as much data as we can to help the engineers trying to reproduce the same level of difference. * with respect to the OSP5 response, let me re-phrase the question. Has this issue been a problem for you from day1 of your deployment or did it grow worse over time or possibly with just large image sizes? * we are continuing to investigate the ratio, but it seems pretty clear that the difference will be at least twice as bad and it's not clear to me what the cost of conversion is depending on the resource load and image type. I just want to be clear this may be a difficult issue on a release this old. thanks, Paul
Hi Paul, * with respect to the OSP5 response, let me re-phrase the question. Has this issue been a problem for you from day1 of your deployment or did it grow worse over time or possibly with just large image sizes? - we didn't really track this in the beginning. But we have different sites with the same setup (like hardware/network setup, and OSP5 software version)and didn't exhibit the same performance issue * we are continuing to investigate the ratio, but it seems pretty clear that the difference will be at least twice as bad and it's not clear to me what the cost of conversion is depending on the resource load and image type. I just want to be clear this may be a difficult issue on a release this old. - Yeah, we are not sure either. What concerned us is the fluctuation in performance too. Based on the sosreport in the original redhat case, we don't really see any high load on glance controller node either and ceph was pretty stable throughout the process too. So we really want to see which parts (either glance code or rbd driver) could have possibly causing any issue.
We held a lengthy customer call session with the glance and cinder folks on this issue quite a while ago. The follow up indicated the case was closed because the deployment was no longer going to be supported. Closing out this BZ accordingly.