1375430 – Glance image-download slow performance

Bug 1375430 - Glance image-download slow performance

Summary: Glance image-download slow performance

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	python-glanceclient
Sub Component:
Version:	5.0 (RHEL 7)
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	async
Target Release:	5.0 (RHEL 7)
Assignee:	Cyril Roelandt
QA Contact:	Avi Avraham
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-09-13 06:13 UTC by James Biao
Modified:	2020-01-17 15:56 UTC (History)
CC List:	12 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-04-17 23:10:29 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description James Biao 2016-09-13 06:13:44 UTC

Description of problem:

I am handling a customer support case from Cisco and here are 2 findings,

1. downloading glance image using "rbd export" performances much faster than glance image-download
2. glance image-download using local repository performance much faster than using rbd backend 

Below is from customer's update 
~~~~~~~~~~~~~
See the following timing, while "rbd export" started at the same time as 
"glance image-download". But "rbd export" finished in 12s while it took 1m15s for "glance image-download". During the "glance image-download", the progress meter shows frequent "pause and go" but "rbd export" does not. 

If we stage the exact same image on local file system of the glance-api node and disable the rbd data store option in glance-api.conf,  "glance image-download" will fly through under 15s. So that proves no issue with general node configuration.

root@csm-a-infra-001:~# date; time glance --os-image-url http://csx-a-glancectl-004:9292 image-download --file /dev/null --progress 68ac8018-f120-4dcd-8a7d-e1c66c4fe91a; date
Mon Sep 12 12:55:15 UTC 2016
[=============================>] 100%

real   	1m15.741s
user   	0m4.504s
sys    	0m8.000s
Mon Sep 12 12:56:30 UTC 2016

[root@csx-a-glancectl-004 ~]# date; rm -f /tmp/wgs.qcow2; time rbd export csx-a-aio-glance-image-1/68ac8018-f120-4dcd-8a7d-e1c66c4fe91a /tmp/wgs.qcow2 --id csx-a-aio-glance-user; date
Mon Sep 12 12:55:16 UTC 2016
2016-09-12 12:55:17.065666 7fb7efc747c0 -1 asok(0x3659970) AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to bind the UNIX domain socket to '/var/run/ceph/rbd-25748.asok': (2) No such file or directory
Exporting image: 100% complete...done.

real   	0m12.252s
user   	0m1.336s
sys    	0m9.930s
Mon Sep 12 12:55:29 UTC 2016
~~~~~~~~~~~~~

Wondering if there's any configuration to improve the performance or if it is as expected and need to explain customer.

customer is aware of following blueprint 
https://blueprints.launchpad.net/nova/+spec/direct-download-for-rbd

Version-Release number of selected component (if applicable):

RHEL OSP 5 on 7

python-glanceclient-0.13.1-1.el7ost.noarch
openstack-glance-2014.2.3-3.el7ost.noarch

How reproducible:
Always

Steps to Reproduce:
1.Configure glance using rbd backend
2.compare "rbd export" and "glance image-download"

scenario 2

1.time glance image-download using rbd backend.
2.reconfiguring glance using local repository, and download the same qcow2 image to compare.

Actual results:
rbd export much faster than glance image-download
glance image-download using local repository much faster than rbd backend

Expected results:
Anyway to tune for better performance?

Additional info:

Comment 3 Flavio Percoco 2016-09-14 10:38:21 UTC

Unfortunately, there's no configuration option to make this faster. The reason downloading from rbd is faster than downloading from glance-api is that in the latter case you are basically downloading the image two times (Reading from rbd and then downloading from HTTP). It's not a direct stream.

The way this is solved architecturally is by having glance-cache deployed along with the compute node.

The work on the direct image download from nova is still WIP (mostly discussion phase) but it depends on other work too.

Comment 4 James Biao 2016-09-16 06:33:13 UTC

Hello Flavio,

Thanks for the explanation. Customer is also interested in why glance image-download from local repository performs better than from rbd backend. 

"If we stage the exact same image on local file system of the glance-api node and disable the rbd data store option in glance-api.conf,  "glance image-download" will fly through under 15s."

Comment 5 Flavio Percoco 2016-09-20 06:31:02 UTC

(In reply to James Biao from comment #4)
> Hello Flavio,
> 
> Thanks for the explanation. Customer is also interested in why glance
> image-download from local repository performs better than from rbd backend. 
> 
> "If we stage the exact same image on local file system of the glance-api
> node and disable the rbd data store option in glance-api.conf,  "glance
> image-download" will fly through under 15s."

Hey James,

As I explained in my previous comment, reading from the network (rbd, in this case) will make the download process slower than reading from the local filesystem. Is the network fast enough? Is there enough bandwidth? 

It's very likely they won't be able to match the same performance results they are seeing with the local cache. This has improved a bit in newer releases, btw.

Comment 6 wesun 2016-09-23 22:05:29 UTC

No the network is not the bottleneck. If glance api rbd driver has been performing store (to local file system) and forwarding the cached local image, it will be fast than the current streaming mechanism. It is not clear to me why it is acceptable for the suboptimal code of the streaming.

(1) store and forwarding (hyperthetic process)
ceph/rbd to a local file system:  15s
glance api send the local stored image file to client: 15-17s

(2) streaming (current mechanism without caching to local file system on the glance-api node)
glance api perform rbd download and streaming to client: 1m30~10m.


It seems that streaming code path for rbd in glance has very poor efficiency.

--weiguo

Comment 8 Cyril Roelandt 2016-10-07 18:48:35 UTC

So, I tried to reproduce this on an environment Avi set up for me. With a 1.5GB image, these are my results:

[+] Glance image-download (should be quite slow)
real	0m20.095s
user	0m13.867s
sys	0m4.452s

[+] rbd export (should be quite fast)
Exporting image: 100% complete...done.

real	0m12.831s
user	0m4.727s
sys	0m6.255s


Indeed, "rbd export" is faster than "glance image-download", but the latter does not seem *that* slow. I'm not sure I've really managed to reproduce the issue. What is the size of the image you use?

I'm getting similar results when running the API calls manually using curl (~18.5seconds), so if there is an issue, it is probably not with the client.

Comment 9 wesun 2016-10-07 18:50:46 UTC

4.5GB raw image is what we have used for testing. --weiguo

Comment 10 wesun 2016-10-07 19:27:51 UTC

To add, we have also observed similar slowness for "glance image-create" process. --weiguo

Comment 11 Paul Grist 2016-10-11 17:51:09 UTC

Thanks for the additional information.

I'm not a glance expert, but seeing the size of the image makes me think about a couple of things and also raises some questions.  From scanning the bug and blueprint, it looks to me like there is a significant difference in the 2 paths in addition to the fact a lot has changed in the code paths from OSP5 to OSP10.

* What was the result of trying Glance caching mentioned back in #c3? I would think the initial load would suffer but then subsequent attempts would be better? This seems to be the only known mitigation path in OSP5.

* Has this always been a problem in the OSP5 production environment?  Knowing if images sizes have grown or load could be affecting the xfer paths would be helpful.

* The numbers here may be basic, but I would believe that glance fetching the image, converting it and then nova getting the image (conversion + 2 hops), doesn't have much of a chance of being as fast as the RBD export (1 hop). Again, I'm definitely not an expert but have added some folks to the bug to comment.

Comment 12 phchoi 2016-10-12 05:49:14 UTC

Hi Paul,

So here is the response to your question.

* What was the result of trying Glance caching mentioned back in #c3? I would think the initial load would suffer but then subsequent attempts would be better? This seems to be the only known mitigation path in OSP5.

- we didn't test it as it doesnt suit our use case very much. Though glance cache will help improving the instance creation in some scenario but it didn't help if our user upload a new image or create a new image from snapshot. Also the same behavior applies to glance image-create too and the upload scenario can't be improved by glance image cache.

* Has this always been a problem in the OSP5 production environment?  Knowing if images sizes have grown or load could be affecting the xfer paths would be helpful.

- we only have OSP5 environment and dont have any other OSP version of environment. 

* The numbers here may be basic, but I would believe that glance fetching the image, converting it and then nova getting the image (conversion + 2 hops), doesn't have much of a chance of being as fast as the RBD export (1 hop). Again, I'm definitely not an expert but have added some folks to the bug to comment.

- yes, we understand glance download to be slower than rbd export but not in a scale that we provided (kind of like 10 mins vs 1 mins download speed).

Comment 13 Paul Grist 2016-10-12 22:27:11 UTC

Thanks for the response, trying to gather as much data as we can to help the engineers trying to reproduce the same level of difference.

* with respect to the OSP5 response, let me re-phrase the question. Has this issue been a problem for you from day1 of your deployment or did it grow worse over time or possibly with just large image sizes?

* we are continuing to investigate the ratio, but it seems pretty clear that the difference will be at least twice as bad and it's not clear to me what the cost of conversion is depending on the resource load and image type.  I just want to be clear this may be a difficult issue on a release this old.

thanks,
Paul

Comment 15 phchoi 2016-11-02 05:36:08 UTC

Hi Paul,

* with respect to the OSP5 response, let me re-phrase the question. Has this issue been a problem for you from day1 of your deployment or did it grow worse over time or possibly with just large image sizes?

- we didn't really track this in the beginning. But we have different sites with the same setup (like hardware/network setup, and OSP5 software version)and didn't exhibit the same performance issue

* we are continuing to investigate the ratio, but it seems pretty clear that the difference will be at least twice as bad and it's not clear to me what the cost of conversion is depending on the resource load and image type.  I just want to be clear this may be a difficult issue on a release this old.

- Yeah, we are not sure either. What concerned us is the fluctuation in performance too. Based on the sosreport in the original redhat case, we don't really see any high load on glance controller node either and ceph was pretty stable throughout the process too. So we really want to see which parts (either glance code or rbd driver) could have possibly causing any issue.

Comment 20 Paul Grist 2017-04-17 23:10:29 UTC

We held a lengthy customer call session with the glance and cinder folks on this issue quite a while ago.  The follow up indicated the case was closed because the deployment was no longer going to be supported.  Closing out this BZ accordingly.

Note You need to log in before you can comment on or make changes to this bug.