Bug 1193509

Summary: [SCALE] Rhevm synchronization bottlenecks - vm template and copy disk slowness
Product: Red Hat Enterprise Virtualization Manager Reporter: Yuri Obshansky <yobshans>
Component: ovirt-engineAssignee: Liron Aravot <laravot>
Status: CLOSED ERRATA QA Contact: Eldad Marciano <emarcian>
Severity: high Docs Contact:
Priority: high    
Version: 3.5.0CC: acanan, aefrat, amureini, iheim, laravot, lpeer, lsurette, pkliczew, rbalakri, Rhev-m-bugs, tspeetje, yeylon, ykaul, ylavi, yobshans
Target Milestone: ovirt-3.6.3Keywords: TestOnly
Target Release: 3.6.0   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-03-09 20:57:49 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Scale RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
engine.log file
none
vdsm.log file
none
thread dump
none
Test results report
none
3.6.3-0.1 engine log
none
3.6.3-0.1 vdsm log
none
3.6.3-0.1 vdsm 1 log none

Description Yuri Obshansky 2015-02-17 13:47:15 UTC
Description of problem:
Problem detected during load test which simulates vm templates creation and copy disk actions performed by 10 concurrent threads.
The most time consumed operations are:
POST Copy Template Disk to Storage - 1172119 ms ~ 19 min
POST Create VM Template - 311459 ms ~ 5 min
The next errors I got in engine.log during test execution:
1.
2015-02-15 16:10:03,804 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.UploadStreamVDSCommand] (DefaultQuartzScheduler_Worker-10) [59349bb5] Command UploadStreamVDSCommand(HostName = HOST-1, HostId = e9987fb8-09b3-4e05-8402-a78153b0f7e7) execution failed. Exception: VDSNetworkException: java.util.concurrent.TimeoutException
2015-02-15 16:10:03,804 ERROR [org.ovirt.engine.core.bll.UploadStreamCommand] (DefaultQuartzScheduler_Worker-10) [59349bb5] Command org.ovirt.engine.core.bll.UploadStreamCommand throw Vdc Bll exception. With error message VdcBLLException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: java.util.concurrent.TimeoutException (Failed with error VDS_NETWORK_ERROR and code 5022)
2.
2015-02-15 17:15:07,206 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.FullListVdsCommand] (DefaultQuartzScheduler_Worker-66) [595c41fa] Failed in FullListVds method
2015-02-15 17:15:07,206 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.FullListVdsCommand] (DefaultQuartzScheduler_Worker-66) [595c41fa] Command FullListVdsCommand(HostName = HOST-1, HostId = e9987fb8-09b3-4e05-8402-a78153b0f7e7, vds=Host[HOST-1,e9987fb8-09b3-4e05-8402-a78153b0f7e7], vmIds=[2a5ff114-fd21-4032-b373-ac71d278e0c5, be9318a7-8bb1-4445-9598-c97a90bad177, bdad524d-fb93-45a3-af0a-f8465884c8f5, 47a66f6f-2673-456a-b2b4-d0879d1bec2e]) execution failed. Exception: VDSErrorException: VDSGenericException: VDSErrorException: Failed to FullListVds, error = dictionary changed size during iteration, code = -32603  

Version-Release number of selected component (if applicable):
RHEV-M version - 3.5.0-0.28

How reproducible:
The scripts simulated the following transactions:
    POST Create VM
    GET VM Status (till VM is Down)
    POST VM Start
    GET VM Status (till VM is UP)
    SSH Command # date
    POST VM Stop
    GET VM Status (till VM is Down)
    POST Create VM Template
    GET Template Status (till Template is OK)
    GET Template Disk UID
    GET Storage UID
    POST Copy Template Disk to Storage
    GET Template Disk Status (till it is OK)
    DELETE Template
    DELETE VM

Duration : 2 hours

Steps to Reproduce:
1.
2.
3.

Actual results:

Transaction 	         Count 	Average 	90% Line 	Min 	Max
VM Create REST API Request 	60 	26536 	53406 	942 	63507
VM Status REST API Down 	2123 	22 	34 	15 	219
VM Create Full Flow 	60 	35485 	60222 	7370 	64545
VM Start REST API Request 	60 	3977 	9115 	311 	14090
VM Status REST API Up 	3437 	22 	32 	15 	174
VM Start Full Flow 	53 	62711 	68830 	43672 	73125
SSH Command date 	53 	291 	749 	47 	852
Stop VM REST API Request 	53 	3264 	6634 	1103 	8963
Stop VM Full Flow 	53 	40 	55 	25 	88
Template Create REST API Request 	53 	13846 	27971 	3550 	64627
Template Status REST API 	11491 	13 	15 	8 	569
Template Create Full Flow 	50 	243811 	311459 	172729 	344233
Template Get Disk ID REST API 	50 	31 	31 	16 	188
Storage Get ID 	50 	32 	49 	18 	64
Template Copy Disk REST API Request 	50 	40097 	61322 	483 	69154
Disk Status OK REST API 	51300 	11 	13 	7 	699
Template Copy Disk Full Flow 	50 	1106749 	1172119 	954465 	1190127
Delete Template REST API 	50 	53269 	78672 	5638 	113898
Delete VM REST API 	50 	29190 	44676 	2945 	56004

Expected results:


Additional info:
Attached:
- Detailed test results report
- engine.log
- vdsm.log
- thread dump

Piotr (pkliczew) investigated thread dump and concluded:
I took a look at the thread dump and I can see two places in the code where I can see synchronization blocking issue.
It is not as bad as in the dump from customer.
- This is exactly the same issue as we saw in customer env. Here we have only 2 threads being blocked:
  1. org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand.executeVDSCommand() @bci=11, line=153 (Compiled frame)
  2. org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData._updatingTimer_Elapsed() @bci=6, line=167 (Compiled frame)

- This one I haven't seen before and it is synchronized per host. Here we have only 2 threads being blocked:
  1. org.ovirt.engine.core.vdsbroker.VdsIdVDSCommandBase.executeVDSCommand() @bci=16, line=47 (Compiled frame)
  2. org.ovirt.engine.core.vdsbroker.VdsIdVDSCommandBase.executeVDSCommand() @bci=16, line=47 (Compiled frame)
The fist one blocked around 20 threads in customer environment.

Comment 1 Yuri Obshansky 2015-02-17 13:48:20 UTC
Created attachment 992685 [details]
engine.log file

Comment 2 Yuri Obshansky 2015-02-17 13:49:14 UTC
Created attachment 992686 [details]
vdsm.log file

Comment 3 Yuri Obshansky 2015-02-17 13:49:39 UTC
Created attachment 992687 [details]
thread dump

Comment 4 Yuri Obshansky 2015-02-17 13:52:12 UTC
Created attachment 992688 [details]
Test results report

Comment 5 Piotr Kliczewski 2015-02-17 14:09:06 UTC
I took a look at the thread dump and I can see two places in the code where
I can see synchronization blocking issue. 
 - This is exactly the same issue as we saw in customer env. Here we have only 2 threads being blocked:

org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand.executeVDSCommand()     @bci=11, line=153 (Compiled frame)

org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData._updatingTimer_Elapsed()    @bci=6, line=167 (Compiled frame)

- This one I haven't seen before and it is synchronized per host. Here we have only 2 threads being blocked:

org.ovirt.engine.core.vdsbroker.VdsIdVDSCommandBase.executeVDSCommand()     @bci=16, line=47 (Compiled frame)

org.ovirt.engine.core.vdsbroker.VdsIdVDSCommandBase.executeVDSCommand()     @bci=16, line=47 (Compiled frame)

Comment 6 Allon Mureinik 2015-02-18 05:39:42 UTC
Need to check if this synchronization is needed.

Anyway, worth revisiting during SPM removal, as the flows will probably change.
Liron - I'm assigning to you to consider during your current refactoring.

Comment 7 Liron Aravot 2015-03-09 09:56:20 UTC
let's split it to two different cases - the reproduction should be done again using the same transport layer.
1. the customer case, from what i've seen they use the http client
2. yuri's case, which uses json rpc.


1. the customer case - 
In the last pasted log on the case, i can see that a task is created successfully - then the engine just polls the tasks - removing the synchronization (at least on IrsBroker) will add more load on the http client. how did we came to the conclusion that the synchronization is the issue?
in the case i can see that the lastCheck values are getting higher, which means that probably there is heavy load which might explain timeout and problem accessing the storage.

2.Yuri's logs-
In the logs that yuri provided i can see errors in the transport (see example below), so removing the synchronization will just more load - what causes this exceptions?

5-143b-42fd-985e-7e2b7cffdd18, imageId = 7a9cb3cc-d4c6-4b2e-8f3e-5871d84bb164, dstImageGroupId = 92c1276d-c19a-41eb-a8e3-66a4e0f53259, vmId = 3da8d6ef-2ad2-49
a1-8d3b-fc30495d653e, dstImageId = bafc90d9-6199-41f6-84bd-7aa62219bdf1, imageDescription = Active VM, dstStorageDomainId = ee15b9b9-b98c-4da2-9d04-17b7947e2b
36, copyVolumeType = SharedVol, volumeFormat = RAW, preallocate = Sparse, postZero = false, force = false), exception: VDSNetworkException: VDSGenericExceptio
n: VDSNetworkException: Heartbeat exeeded, log id: 8126f9d: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: VDSGenericException: VDSNetworkExce


The thread dump that piotr analyzed does show locks, but it doesn't seem like the cause of the problem - in 3.6 those flows are gonna change anyway for data center >= 3.6, so i won't touch them unless it's mandatory.

Comment 8 Piotr Kliczewski 2015-03-09 10:13:39 UTC
(In reply to Liron Aravot from comment #7)
> let's split it to two different cases - the reproduction should be done
> again using the same transport layer.
> 1. the customer case, from what i've seen they use the http client
> 2. yuri's case, which uses json rpc.
> 
> 
> 1. the customer case - 
> In the last pasted log on the case, i can see that a task is created
> successfully - then the engine just polls the tasks - removing the
> synchronization (at least on IrsBroker) will add more load on the http
> client. how did we came to the conclusion that the synchronization is the
> issue?

We saw it in thread dump from the customer env. There is plenty of tasks which timeouts before being sent to vdsm.

> in the case i can see that the lastCheck values are getting higher, which
> means that probably there is heavy load which might explain timeout and
> problem accessing the storage.

In my opinion using synchronization as throttling mechanism is not the best idea and it will hit us hard in the future. The load should be handled on vdsm side in order to not overload storage. 

> 
> 2.Yuri's logs-
> In the logs that yuri provided i can see errors in the transport (see
> example below), so removing the synchronization will just more load - what
> causes this exceptions?
> 
> 5-143b-42fd-985e-7e2b7cffdd18, imageId =
> 7a9cb3cc-d4c6-4b2e-8f3e-5871d84bb164, dstImageGroupId =
> 92c1276d-c19a-41eb-a8e3-66a4e0f53259, vmId = 3da8d6ef-2ad2-49
> a1-8d3b-fc30495d653e, dstImageId = bafc90d9-6199-41f6-84bd-7aa62219bdf1,
> imageDescription = Active VM, dstStorageDomainId =
> ee15b9b9-b98c-4da2-9d04-17b7947e2b
> 36, copyVolumeType = SharedVol, volumeFormat = RAW, preallocate = Sparse,
> postZero = false, force = false), exception: VDSNetworkException:
> VDSGenericExceptio
> n: VDSNetworkException: Heartbeat exeeded, log id: 8126f9d:
> org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException:
> VDSGenericException: VDSNetworkExce
> 

Heartbeats exeeded are because of other issue BZ 1198680.

> 
> The thread dump that piotr analyzed does show locks, but it doesn't seem
> like the cause of the problem - in 3.6 those flows are gonna change anyway
> for data center >= 3.6, so i won't touch them unless it's mandatory.

I am only able to say about the data gathered during testing 3.6 but I haven't seen such data so I can't tell whether we need to fix it or not.

Comment 11 Allon Mureinik 2015-08-10 16:02:39 UTC
In this patch even still relevant in the current 3.6's codebase?

Comment 12 Liron Aravot 2015-08-20 12:31:25 UTC
In my opinion it's relevant and we should re look into that in DC without SPM.
let's wait for replies from Yuri and Piotr though.

Comment 13 Piotr Kliczewski 2015-08-20 22:24:56 UTC
I am not sure whether it is still relevant for 3.6. The changes were tested with 3.4 and 3.5. We would need to run the test one more time to understand whether this change improves performance.

Comment 14 Yuri Obshansky 2015-08-23 06:49:51 UTC
I'm agree with Piotr. We could retest that flow on 3.6. Let me know if it still relevant and we'll put it in testing tasks queue.

Comment 15 Liron Aravot 2015-08-23 11:01:17 UTC
Allon, what do you think?

Comment 16 Allon Mureinik 2015-08-23 14:12:12 UTC
Agreed.
Yuri - let's add these to the scale team's queue.

Thanks!

Comment 17 Yuri Obshansky 2015-08-23 14:20:45 UTC
Added to scaleteam's queue.
Will be verified on one of the stable build of 3.6

Comment 20 Yuri Obshansky 2016-02-23 10:20:25 UTC
Bug verified on 
RHEVM: 3.6.3-0.1
KVM Version: 2.3.0 - 31.el7_2.7
LIBVIRT Version: libvirt-1.2.17-13.el7_2.3
VDSM Version: vdsm-4.17.19-0.el7ev

Performed the same load test like described in bug.
There is no Timeout errors happened

Response time is better than early
POST Copy Template Disk to Storage - 
194327 ms ~ 3 min (instead of 1172119 ms ~ 19 min)

POST Create VM Template - 
207961 ms ~ 3 min (instead of 311459 ms ~ 5 min)

The full response time tables:
Label	                        #	Average	90%Line	Min	Max
VM Create REST API Request	60	4596	8894	1048	16068
VM Status REST API Down	        17377	43	44	25	302
VM Create Full Flow	        60	25940	43165	7372	57043
VM Start REST API Request	60	985	1474	359	2677
VM Status REST API Up	        3995	46	47	28	403
VM Start Full Flow	        60	69390	75079	60561	78089
root.64.22:22	        60	88	330	26	364
Stop VM REST API Request	60	1360	2087	722	13296
Stop VM Full Flow	        60	13834	14987	2849	15345
Template Create REST APIRequest 60	4739	6907	2181	9514
Template Status REST API 	9629	30	31	18	330
Template Create Full Flow	60	166251	207961	38319	226901
Template Get Disk ID REST API 	60	14	15	13	17
Storage Get ID REST API 	60	143	184	117	324
Template Copy Disk Request 	60	5533	10290	1458	16933
Disk Status OK REST API 	9298	15	16	9	341
Template Copy Full Disk Flow	60	158141	194327	45985	208160
Delete Template REST API 	60	8112	11626	2746	18545
Delete VM REST API 	        60	5006	7581	859	12684

Attached engine and vdsm log files.

Looks like synchronization issue has fixed and bug can be closed.

But, during test raised errors which is looks like functional 
Need investigate it:

2016-02-23 10:47:04,764 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ajp-/127.0.0.1:8702-3) [74410bb2] Correlation ID: 1fe3327e, Job ID: 9086db57-102c-49b8-80d2-f70761e636d4, Call Stack: null, Custom Event ID: -1, Message: User admin@internal is copying disk testvm1_Disk1 to domain NFS-SD-23.

2016-02-23 10:50:04,578 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-42) [] Correlation ID: 1fe3327e, Job ID: 9086db57-102c-49b8-80d2-f70761e636d4, Call Stack: null, Custom Event ID: -1, Message: User admin@internal finished copying disk <UNKNOWN> to domain NFS-SD-23.
2016-02-23 10:50:04,579 ERROR [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (org.ovirt.thread.pool-6-thread-42) [] [within thread]: endAction for action type MoveOrCopyDisk threw an exception.: java.lang.NullPointerException
        at org.ovirt.engine.core.bll.MoveOrCopyDiskCommand.isTemplate(MoveOrCopyDiskCommand.java:424) [bll.jar:]
        at org.ovirt.engine.core.bll.MoveOrCopyDiskCommand.incrementDbGenerationForRelatedEntities(MoveOrCopyDiskCommand.java:313) [bll.jar:]
        at org.ovirt.engine.core.bll.MoveOrCopyDiskCommand.endSuccessfully(MoveOrCopyDiskCommand.java:306) [bll.jar:]
        at org.ovirt.engine.core.bll.CommandBase.internalEndSuccessfully(CommandBase.java:651) [bll.jar:]
        at org.ovirt.engine.core.bll.CommandBase.endActionInTransactionScope(CommandBase.java:595) [bll.jar:]
        at org.ovirt.engine.core.bll.CommandBase.runInTransaction(CommandBase.java:1984) [bll.jar:]
        at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInSuppressed(TransactionSupport.java:174) [utils.jar:]
        at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInScope(TransactionSupport.java:116) [utils.jar:]
        at org.ovirt.engine.core.bll.CommandBase.endAction(CommandBase.java:524) [bll.jar:]
        at org.ovirt.engine.core.bll.tasks.DecoratedCommand.endAction(DecoratedCommand.java:17) [bll.jar:]
        at org.ovirt.engine.core.bll.tasks.CoCoAsyncTaskHelper.endAction(CoCoAsyncTaskHelper.java:359) [bll.jar:]
        at org.ovirt.engine.core.bll.tasks.CommandCoordinatorImpl.endAction(CommandCoordinatorImpl.java:415) [bll.jar:]
        at org.ovirt.engine.core.bll.tasks.CommandAsyncTask.endCommandAction(CommandAsyncTask.java:168) [bll.jar:]
        at org.ovirt.engine.core.bll.tasks.CommandAsyncTask.access$000(CommandAsyncTask.java:30) [bll.jar:]
        at org.ovirt.engine.core.bll.tasks.CommandAsyncTask$1.run(CommandAsyncTask.java:111) [bll.jar:]
        at org.ovirt.engine.core.utils.threadpool.ThreadPoolUtil$InternalWrapperRunnable.run(ThreadPoolUtil.java:92) [utils.jar:]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [rt.jar:1.8.0_51]
        at java.util.concurrent.FutureTask.run(FutureTask.java:266) [rt.jar:1.8.0_51]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [rt.jar:1.8.0_51]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [rt.jar:1.8.0_51]
        at java.lang.Thread.run(Thread.java:745) [rt.jar:1.8.0_51]

Comment 21 Yuri Obshansky 2016-02-23 10:21:49 UTC
Created attachment 1129698 [details]
3.6.3-0.1 engine log

Comment 22 Yuri Obshansky 2016-02-23 10:22:21 UTC
Created attachment 1129699 [details]
3.6.3-0.1 vdsm log

Comment 23 Yuri Obshansky 2016-02-23 10:22:52 UTC
Created attachment 1129700 [details]
3.6.3-0.1 vdsm 1 log

Comment 24 Yuri Obshansky 2016-02-23 10:25:37 UTC
OK, I found bug which reported about issue with 
"MoveOrCopyDisk threw an exception.: java.lang.NullPointerException"
https://bugzilla.redhat.com/show_bug.cgi?id=1302715

So, current bug can be closed

Comment 26 errata-xmlrpc 2016-03-09 20:57:49 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2016-0376.html