Bug 1346727

Summary: [Blocked] REST APIs calls response time and transactions elapsed time constantly increasing during the 8 hours stability test
Product: Red Hat OpenStack Reporter: Yuri Obshansky <yobshans>
Component: openstack-glanceAssignee: Cyril Roelandt <cyril>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Yuri Obshansky <yobshans>
Severity: high Docs Contact:
Priority: unspecified    
Version: 9.0 (Mitaka)CC: aavraham, berrange, dasmith, egafford, eglynn, fpercoco, kchamart, sbauza, scohen, sferdjao, sgordon, srevivo, vromanso, yobshans
Target Milestone: ---Keywords: ZStream
Target Release: 11.0 (Ocata)   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-12-21 14:51:06 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Test result. Unzip attached archive to local disk and invoke file result.html
none
.NOVA.GET.Images graph
none
GLANCE.GET.Images graph
none
Single thread NOVA.GET.Images graph
none
Single thread GLANCE.GET.Images graph none

Description Yuri Obshansky 2016-06-15 09:01:54 UTC
Created attachment 1168258 [details]
Test result. Unzip attached archive to local disk and invoke file result.html

Description of problem:
NOVA REST APIs calls response time and transactions elapsed time constantly increasing during the 8 hours stability test. 
Test loaded the environment with 10 concurrent threads which simulated different operations on keystone, neutron and nova using REST API calls.
Approximate load was 6.5 calls per second (which is not heavy at all).
Statistics for REST API calls:
09.NOVA.GET.Images - 1 sec up to 8 sec
13.NOVA.POST.Boot.Server - 1 sec up to 8 sec
22.NOVA.POST.Create.Image - 0.2 sec up to 5.7 sec
23.NOVA.GET.Image.Id - 0.06 sec up to 5 sec
24.NOVA.DELETE.Image - 0.08 sec up to 6 sec
11.NEUTRON.POST.Create.Network - 0.3 sec up to 4 sec
Statistics for Transactions:
13.NOVA.POST.Boot.Server.Trx - 5 sec up to 32 sec
14.NOVA.POST.Pause.Server.Trx - 0.7 sec up to 6 sec
15.NOVA.POST.Unpause.Server.Trx - 0.7 sec up to 6 sec
16.NOVA.POST.Suspend.Server.Trx - 0.7 sec up to 6 sec
18.NOVA.POST.Soft.Reboot.Server.Trx - 35 sec up to 72 sec
20.NOVA.POST.Stop.Server.Trx - 17 sec up to 34 sec
Detailed report can be found online 
http://10.1.64.51:8080/job/RHOS-8-REST-API-Performance-test-Baseline/16/Test_result_HTML_Report/
or unzip attached archive to local disk and invoke file result.html

Version-Release number of selected component (if applicable):
rhos-release 8 -p 2016-05-31.1

How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 2 Yuri Obshansky 2016-06-15 09:46:07 UTC
The environment is 10 BMs (24x Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz RAM 64G)
It included 1 OSPd node, 3 Controller nodes and 6 Compute nodes.
For instance creation used flavor m1.medium 2 VCPUs, 4GB RAM, 40G disk
and image overcloud-full (size 1023.2 MB).
Default installation and deploy.

Comment 3 Dan Smith 2016-06-16 18:00:50 UTC
> 09.NOVA.GET.Images - 1 sec up to 8 sec

This is a proxy to glance. Do the equivalent calls to glance directly take the same amount of time or less?

> 13.NOVA.POST.Boot.Server - 1 sec up to 8 sec

This involves calls to glance and neutron, so if those are being slow, this will be slow.

> 22.NOVA.POST.Create.Image - 0.2 sec up to 5.7 sec
> 23.NOVA.GET.Image.Id - 0.06 sec up to 5 sec
> 24.NOVA.DELETE.Image - 0.08 sec up to 6 sec

Also proxies to glance.

> 11.NEUTRON.POST.Create.Network - 0.3 sec up to 4 sec

This is a call directly to neutron and not related to nova or this bug, right?

> Statistics for Transactions:
> 13.NOVA.POST.Boot.Server.Trx - 5 sec up to 32 sec
> 14.NOVA.POST.Pause.Server.Trx - 0.7 sec up to 6 sec
> 15.NOVA.POST.Unpause.Server.Trx - 0.7 sec up to 6 sec
> 16.NOVA.POST.Suspend.Server.Trx - 0.7 sec up to 6 sec
> 18.NOVA.POST.Soft.Reboot.Server.Trx - 35 sec up to 72 sec
> 20.NOVA.POST.Stop.Server.Trx - 17 sec up to 34 sec

What do these mean? Is this (for example) "from boot to when the instance is ACTIVE" ?

Stop and Soft Reboot involve the guest, so I'm not sure they're worth discussing as there are other variables that affect it which aren't under our control. Pause is a very simple operation, so I would expect it to always be fast. Curious to know what this means in terms of "transaction" to comment further.

Comment 4 Yuri Obshansky 2016-06-16 19:13:30 UTC
(In reply to Dan Smith from comment #3)
> > 09.NOVA.GET.Images - 1 sec up to 8 sec
> 
> This is a proxy to glance. Do the equivalent calls to glance directly take
> the same amount of time or less?
> 
I don't know. I didn't measure glance REST API
> > 13.NOVA.POST.Boot.Server - 1 sec up to 8 sec
> 
> This involves calls to glance and neutron, so if those are being slow, this
> will be slow.
We talks about an increasing time, Why it was 1 sec and grew up to 8 sec?
> 
> > 22.NOVA.POST.Create.Image - 0.2 sec up to 5.7 sec
> > 23.NOVA.GET.Image.Id - 0.06 sec up to 5 sec
> > 24.NOVA.DELETE.Image - 0.08 sec up to 6 sec
> 
> Also proxies to glance.
> 
> > 11.NEUTRON.POST.Create.Network - 0.3 sec up to 4 sec
> 
> This is a call directly to neutron and not related to nova or this bug,
> right?
> Yes, not related to nova.Just for information.

> > Statistics for Transactions:
> > 13.NOVA.POST.Boot.Server.Trx - 5 sec up to 32 sec
> > 14.NOVA.POST.Pause.Server.Trx - 0.7 sec up to 6 sec
> > 15.NOVA.POST.Unpause.Server.Trx - 0.7 sec up to 6 sec
> > 16.NOVA.POST.Suspend.Server.Trx - 0.7 sec up to 6 sec
> > 18.NOVA.POST.Soft.Reboot.Server.Trx - 35 sec up to 72 sec
> > 20.NOVA.POST.Stop.Server.Trx - 17 sec up to 34 sec
> 
> What do these mean? Is this (for example) "from boot to when the instance is
> ACTIVE" ?
> 
Right. That is transaction. 
Measuring started from REST API call till instance's status is ACTIVE.

> Stop and Soft Reboot involve the guest, so I'm not sure they're worth
> discussing as there are other variables that affect it which aren't under
> our control. Pause is a very simple operation, so I would expect it to
> always be fast. Curious to know what this means in terms of "transaction" to
> comment further.
Again, "transaction" means the period from call some REST API till the state of instance changed to expected. For example from ACTIVE till PAUSED. If you have another definition I'll happy to use it in my reports.

Comment 5 Dan Smith 2016-06-16 19:20:49 UTC
> I don't know. I didn't measure glance REST API

Doing that is important to know whether the slowdown is related to glance or nova.

> We talks about an increasing time, Why it was 1 sec and grew up to 8 sec?

Answering that will be easier if we know if it's nova or glance that is slowing.

> Right. That is transaction. 
> Measuring started from REST API call till instance's status is ACTIVE.

Okay, what is your polling interval? Are you sure that the lag you're seeing is in the instance going to ACTIVE state and not just your HTTP request at one point hanging for a while?

Also, please show what config you're booting the servers with (i.e. boot options) as well as capture logs from all systems towards the end of the test with the slowdown. Or, point me at them if they're available.

> Again, "transaction" means the period from call some REST API till the state
> of instance changed to expected. For example from ACTIVE till PAUSED. If you
> have another definition I'll happy to use it in my reports.

Transaction usually refers to the database, but it's a fine term, I just wanted to make sure I was clear about what you meant.

Logs will help identify some delays in simple things like a pause operation. Please capture all logs with debug=True.

Comment 6 Yuri Obshansky 2016-06-20 08:42:08 UTC
Hi Dan,
I'm agreed with you that it is important to understand whether the slowdown.
Regards 09.NOVA.GET.Images - there is a possible to add additional REST API call using Image service API v2 and correlate two graphs after test completed.
But, 13.NOVA.POST.Boot.Server - it's a little bit complicated.
I cannot duplicate the instance creation using different APIs.
Need discuss it offline (ircc).

Pooling interval is 1000 msec, so mistake could be +/- 1 sec
But, in this case is not critical since we would like to see the dynamic of response time changes during the some period.

Boot instance config is:
{"server" : 
	{"name" : "${srv_name}",
      "imageRef" : "${img_id}",
      "flavorRef" : "3",
      "networks": 
      [
      	{"uuid": "${net_id}"
          }
    	 ]
   }
}
Where flavor is m1.medium 2 VCPUs, 4GB RAM, 40G disk
and image is overcloud-full (size 1023.2 MB).

Comment 7 Dan Smith 2016-06-20 19:57:20 UTC
Yuri,

Boot server involves many calls to other services, including glance and neutron. So, I know you can't fully reproduce the behavior with other calls, but knowing the timing of calls to those other services is critical to knowing where the delays are.

Still looking for logs with debug enabled and that information on the actual glance performance. Please set needinfo on me when you have that.

Comment 8 Dan Smith 2016-06-24 16:58:09 UTC
Started looking at the live machine a bit, beginning with image-list. I can confirm that of a roughly 6-8 second 'nova image-list' operation, about three seconds of the work is spent waiting on the glance api call here:

https://github.com/openstack/nova/blob/stable/liberty/nova/api/openstack/compute/images.py#L144-L145

About two seconds are spent communicating with keystone, and the rest seems to be just back and forth between the client probing the nova api version, getting its own token from keystone, etc.

Comment 9 Dan Smith 2016-06-24 17:12:52 UTC
Similarly, if I boot an instance and then do something like "nova show $instance", I see the following time spent:

- ~ 2.0 seconds authenticating with keystone
- ~ 2.4 seconds communicating with neutron
- ~ 3.6 seconds communicating with glance

A total of 9 seconds passed between the time in which nova-api received the request and when the response was returned. The above delays proxying to other services account for ~8 seconds of that 9.

Actual timing for reference is:

2016-06-24 17:03:25.581 Request received by nova-api
2016-06-24 17:03:26.209 Start request to keystone to validate token
2016-06-24 17:03:28.287 Response received from keystone
2016-06-24 17:03:28.638 Start request to neutron to validate port
2016-06-24 17:03:31.001 Response received from neutron
2016-06-24 17:03:31.089 Start request to glance to show image detail
2016-06-24 17:03:34.641 Response received from glance

Certainly these delays also impact boot server (to a much larger degree), as all the above involved services are contacted there as well, although it's harder to trace them through the process as that is split between multiple service nodes. I chose the "show" operation for tidiness.

Comment 10 Dan Smith 2016-06-24 17:32:57 UTC
Examining pause and unpause, I see the following:

2016-06-24 17:16:45.096 Request received by nova
2016-06-24 17:16:45.113 Start request to keystone to validate token
2016-06-24 17:16:46.553 Response received from keystone
2016-06-24 17:16:47.929 Server query returned from nova
2016-06-24 17:16:48.076 Start request to neutron to validate port
2016-06-24 17:16:50.441 Response received from neutron
2016-06-24 17:16:50.451 Server detail returned from nova
2016-06-24 17:16:50.490 Pause request received by nova
2016-06-24 17:16:51.038 Pause operation returns 202 to the client

First poll after the pause request ends shows the instance in PAUSED state. This tells me that the request was sent to the compute node and immediately acted upon. For example:

[heat-admin@overcloud-controller-2 ~]$ nova list --fields name,status; nova pause dantest; nova list --fields name,status
+--------------------------------------+---------+--------+
| ID                                   | Name    | Status |
+--------------------------------------+---------+--------+
| 05083c2c-bb54-41f2-81b1-28f2058493a4 | dantest | ACTIVE |
+--------------------------------------+---------+--------+
+--------------------------------------+---------+--------+
| ID                                   | Name    | Status |
+--------------------------------------+---------+--------+
| 05083c2c-bb54-41f2-81b1-28f2058493a4 | dantest | PAUSED |
+--------------------------------------+---------+--------+

The above operation took several seconds, but the actual pausing of the instance seems immediate to me. As best I can tell, the delays involved in running these commands come from the aforementioned delays communicating with services external to nova.

Perhaps you can compare these detailed delays to those experienced (or not) when the deployment is fresh to see if they are indeed much smaller. If so, we should ask the respective service owners (neutron, glance, and perhaps keystone) to examine the performance after the services have been running for a while.

When you do, it can be helpful to use the following syntax for the nova command, which allows you to target a specific API node to keep all the (API) logs in a single place:

nova --bypass-url 'http://172.17.0.14:8774/v2.1/c55ebc05b4994765944a8174683351a9' show dantest

Comment 11 Yuri Obshansky 2016-06-27 14:36:46 UTC
Dan, 
Thank you for your investigation. 
But, I think, we should come back to concrete issue described in the bug - 
REST API response time constantly increased during 8 hours test running.  
i.e. time grows in arithmetic progression. 
I understand your point of view that nova is only proxy.
And executed another one test (twice shorter than original).
http://10.1.64.51:8080/job/RHOS-8-REST-API-Performance-test-Baseline/22/Test_result_HTML_Report/
Let's concentrate on GetImage call.
Test executed 500 times:
NOVA.GET.Images  - {URL:port}/v2.1/${tenant_id}/images?name=${image_name}
GLANCE.GET.Images - {URL:port}/v2/images?name=${image_name}
and times grows in following order:
NOVA.GET.Images - 693 ms -> 4259 ms
GLANCE.GET.Images - 231 ms -> 1235 ms
GLANCE.GET.Images graph flatter than NOVA.GET.Images graph
(see attached images)
Note: test ran with Debug, so value is correct to current configuration only.
I expect that all graphs will flat and stable such as for other calls
for example NOVA.GET.Flavors or 14.NOVA.POST.Pause.Server.

Comment 12 Yuri Obshansky 2016-06-27 14:37:40 UTC
Created attachment 1172893 [details]
.NOVA.GET.Images graph

Comment 13 Yuri Obshansky 2016-06-27 14:38:14 UTC
Created attachment 1172894 [details]
GLANCE.GET.Images graph

Comment 14 Dan Smith 2016-06-27 14:48:26 UTC
The slope of the glance and nova curves are pretty close (~5.3 vs ~6.0). Remember that every nova call has to hit keystone *and* glance.

Note that the flavor list test hits only nova and keystone and the curve is very flat.

When you run one of these tests, if you restart any of the services does it set the curve back to zero or does that only happen when you rebuild the system? I assume these tests are run linearly with a service restart in the middle or something, is that right?

If so, can you please restart glance (all of the glance services on all of the nodes) in the middle of a run and see if the nova curve is affected?

Comment 15 Yuri Obshansky 2016-06-27 15:38:30 UTC
(In reply to Dan Smith from comment #14)
> The slope of the glance and nova curves are pretty close (~5.3 vs ~6.0).
> Remember that every nova call has to hit keystone *and* glance.
> 
Yes, you are right 
We could ignore call to keystone, it is so fast.
> Note that the flavor list test hits only nova and keystone and the curve is
> very flat.
> 
Right, what about other ?
> When you run one of these tests, if you restart any of the services does it
> set the curve back to zero or does that only happen when you rebuild the
> system? I assume these tests are run linearly with a service restart in the
> middle or something, is that right?
> 
> If so, can you please restart glance (all of the glance services on all of
> the nodes) in the middle of a run and see if the nova curve is affected?
I didn't restart glance services.
But, each test run show the the same picture w/o any restart.
That is a good idea!
Will do it tomorrow.
But, I have another - to avoid impact of concurrency (test profile is 10 concurrent threads) I'd like to run single thread test for 8 hours tonight.
Let's see what will be result.

Comment 16 Dan Smith 2016-06-27 18:09:10 UTC
Yeah the linear test would be interesting.

From here, I think you should reclassify this as a glance bug, resolve that issue (which is clearly visible) and then see if we still have a nova problem afterwards.

Comment 17 Yuri Obshansky 2016-06-28 13:58:33 UTC
Single thread test completed (linear)
and there is no slowness detected. 
So, the issue in concurrency i.e. response time increase while several concurrent threads performs REST API calls.
See attached graphs:
SingleThread-NOVA.GET.Images.png
SingleThread-GLANCE.GET.Images.png
All graphs are flat and stable
Deatls about 2 REST API calls
REST APIs            # Calls  Average  	90% Line  Min  	    Max 
09.NOVA.GET.Images   100      1485 ms 	1629 ms   1339 ms   2268 ms
09.GLANCE.GET.Images 100      515 ms  	572 ms    450 ms    653 ms

Comment 18 Yuri Obshansky 2016-06-28 13:59:06 UTC
Created attachment 1173458 [details]
Single thread NOVA.GET.Images graph

Comment 19 Yuri Obshansky 2016-06-28 13:59:35 UTC
Created attachment 1173459 [details]
Single thread GLANCE.GET.Images graph

Comment 20 Dan Smith 2016-06-28 14:10:12 UTC
Yep, that's interesting.

Any chance this is related to your test methodology? Something like you keeping lots of sockets open to the various servers because of a descriptor leak which causes things to get slower over time? We run tempest in parallel upstream and downstream and I'm not aware of any such increase in response time.

Please let me know after you have analysis on the parallel-test increasing delays from glance from someone familiar with that code. Once glance stops increasing over time, we can evaluate if there is additional work to be done in nova.

Comment 22 Yuri Obshansky 2016-06-28 14:18:55 UTC
(In reply to Dan Smith from comment #20)
> Yep, that's interesting.
> 
> Any chance this is related to your test methodology? Something like you
> keeping lots of sockets open to the various servers because of a descriptor
> leak which causes things to get slower over time? We run tempest in parallel
> upstream and downstream and I'm not aware of any such increase in response
> time.
> 
I don't think so. 
JMeter open http connection to nova endpoint and send request.
Nothing special. 
I can monitor amount of open file descriptor during the test running.
Anything else seems to be suspicious?

> Please let me know after you have analysis on the parallel-test increasing
> delays from glance from someone familiar with that code. Once glance stops
> increasing over time, we can evaluate if there is additional work to be done
> in nova.
Yes, sure
Could you recommend me Glance developer to contact him?

Comment 23 Yuri Obshansky 2016-06-28 14:22:04 UTC
(In reply to Dan Smith from comment #20)
> Yep, that's interesting.
> 
> Any chance this is related to your test methodology? Something like you
> keeping lots of sockets open to the various servers because of a descriptor
> leak which causes things to get slower over time? We run tempest in parallel
> upstream and downstream and I'm not aware of any such increase in response
> time.
> 

What's the duration of tempest running?
Because when I ran shot test (1-2 hours) I didn't see that issue also.

> Please let me know after you have analysis on the parallel-test increasing
> delays from glance from someone familiar with that code. Once glance stops
> increasing over time, we can evaluate if there is additional work to be done
> in nova.

Comment 24 Dan Smith 2016-06-28 14:30:35 UTC
Just reassign this to glance or file a new one against glance and bugzilla will direct you to the first point of contact.

> What's the duration of tempest running?
> Because when I ran shot test (1-2 hours) I didn't see that issue also.

Depends on the system, but anywhere from 45 minutes to 2+ hours. Actual wall clock time is not likely to be really related. I'm sure it's number of requests or something like that :)

Comment 25 Sean Cohen 2016-07-07 12:55:51 UTC
(In reply to Dan Smith from comment #24)
> Just reassign this to glance or file a new one against glance and bugzilla
> will direct you to the first point of contact.
> 
> > What's the duration of tempest running?
> > Because when I ran shot test (1-2 hours) I didn't see that issue also.
> 
> Depends on the system, but anywhere from 45 minutes to 2+ hours. Actual wall
> clock time is not likely to be really related. I'm sure it's number of
> requests or something like that :)

Hi Yuri,
Can you verify which glance store was used here?
Sean

Comment 26 Yuri Obshansky 2016-07-12 12:12:01 UTC
Hi Sean, 
Unfortunately I can do nothing right now, since lab is moving from Phoenix to Raleigh. I don't have an environment at all.
I hope, the lab will be UP during 2-3 weeks.
And I'll be required rerun the test again.
But, generally openstack was deployed by default, and glance should be store on controller local file system.
Lt me know if you need additional information

Yuri

Comment 27 Elise Gafford 2016-07-19 18:45:38 UTC
Question for Yuri:

Has this been tested on more recent versions of Glance/Nova (RHOS 9)? If so, do we see this behavior in 9 as well, or is it specific to 8?

Comment 28 Yuri Obshansky 2016-07-20 09:05:41 UTC
Hi Elise, 
I'm still not able to run tests since lab is moving from Phoenix to Raleigh.
Yes, we have a plan to rerun all tests on RHOS 9 ASAP.
Waiting for servers....

Yuri

Comment 29 Sean Cohen 2016-12-21 14:51:06 UTC
Closing due to insufficient data, feel free to re-open once reproduced.
Sean