Hide Forgot
Description of problem: Problem detected during load test which simulates vm templates creation and copy disk actions performed by 10 concurrent threads. The most time consumed operations are: POST Copy Template Disk to Storage - 1172119 ms ~ 19 min POST Create VM Template - 311459 ms ~ 5 min The next errors I got in engine.log during test execution: 1. 2015-02-15 16:10:03,804 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.UploadStreamVDSCommand] (DefaultQuartzScheduler_Worker-10) [59349bb5] Command UploadStreamVDSCommand(HostName = HOST-1, HostId = e9987fb8-09b3-4e05-8402-a78153b0f7e7) execution failed. Exception: VDSNetworkException: java.util.concurrent.TimeoutException 2015-02-15 16:10:03,804 ERROR [org.ovirt.engine.core.bll.UploadStreamCommand] (DefaultQuartzScheduler_Worker-10) [59349bb5] Command org.ovirt.engine.core.bll.UploadStreamCommand throw Vdc Bll exception. With error message VdcBLLException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: java.util.concurrent.TimeoutException (Failed with error VDS_NETWORK_ERROR and code 5022) 2. 2015-02-15 17:15:07,206 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.FullListVdsCommand] (DefaultQuartzScheduler_Worker-66) [595c41fa] Failed in FullListVds method 2015-02-15 17:15:07,206 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.FullListVdsCommand] (DefaultQuartzScheduler_Worker-66) [595c41fa] Command FullListVdsCommand(HostName = HOST-1, HostId = e9987fb8-09b3-4e05-8402-a78153b0f7e7, vds=Host[HOST-1,e9987fb8-09b3-4e05-8402-a78153b0f7e7], vmIds=[2a5ff114-fd21-4032-b373-ac71d278e0c5, be9318a7-8bb1-4445-9598-c97a90bad177, bdad524d-fb93-45a3-af0a-f8465884c8f5, 47a66f6f-2673-456a-b2b4-d0879d1bec2e]) execution failed. Exception: VDSErrorException: VDSGenericException: VDSErrorException: Failed to FullListVds, error = dictionary changed size during iteration, code = -32603 Version-Release number of selected component (if applicable): RHEV-M version - 3.5.0-0.28 How reproducible: The scripts simulated the following transactions: POST Create VM GET VM Status (till VM is Down) POST VM Start GET VM Status (till VM is UP) SSH Command # date POST VM Stop GET VM Status (till VM is Down) POST Create VM Template GET Template Status (till Template is OK) GET Template Disk UID GET Storage UID POST Copy Template Disk to Storage GET Template Disk Status (till it is OK) DELETE Template DELETE VM Duration : 2 hours Steps to Reproduce: 1. 2. 3. Actual results: Transaction Count Average 90% Line Min Max VM Create REST API Request 60 26536 53406 942 63507 VM Status REST API Down 2123 22 34 15 219 VM Create Full Flow 60 35485 60222 7370 64545 VM Start REST API Request 60 3977 9115 311 14090 VM Status REST API Up 3437 22 32 15 174 VM Start Full Flow 53 62711 68830 43672 73125 SSH Command date 53 291 749 47 852 Stop VM REST API Request 53 3264 6634 1103 8963 Stop VM Full Flow 53 40 55 25 88 Template Create REST API Request 53 13846 27971 3550 64627 Template Status REST API 11491 13 15 8 569 Template Create Full Flow 50 243811 311459 172729 344233 Template Get Disk ID REST API 50 31 31 16 188 Storage Get ID 50 32 49 18 64 Template Copy Disk REST API Request 50 40097 61322 483 69154 Disk Status OK REST API 51300 11 13 7 699 Template Copy Disk Full Flow 50 1106749 1172119 954465 1190127 Delete Template REST API 50 53269 78672 5638 113898 Delete VM REST API 50 29190 44676 2945 56004 Expected results: Additional info: Attached: - Detailed test results report - engine.log - vdsm.log - thread dump Piotr (pkliczew) investigated thread dump and concluded: I took a look at the thread dump and I can see two places in the code where I can see synchronization blocking issue. It is not as bad as in the dump from customer. - This is exactly the same issue as we saw in customer env. Here we have only 2 threads being blocked: 1. org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand.executeVDSCommand() @bci=11, line=153 (Compiled frame) 2. org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData._updatingTimer_Elapsed() @bci=6, line=167 (Compiled frame) - This one I haven't seen before and it is synchronized per host. Here we have only 2 threads being blocked: 1. org.ovirt.engine.core.vdsbroker.VdsIdVDSCommandBase.executeVDSCommand() @bci=16, line=47 (Compiled frame) 2. org.ovirt.engine.core.vdsbroker.VdsIdVDSCommandBase.executeVDSCommand() @bci=16, line=47 (Compiled frame) The fist one blocked around 20 threads in customer environment.
Created attachment 992685 [details] engine.log file
Created attachment 992686 [details] vdsm.log file
Created attachment 992687 [details] thread dump
Created attachment 992688 [details] Test results report
I took a look at the thread dump and I can see two places in the code where I can see synchronization blocking issue. - This is exactly the same issue as we saw in customer env. Here we have only 2 threads being blocked: org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand.executeVDSCommand() @bci=11, line=153 (Compiled frame) org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData._updatingTimer_Elapsed() @bci=6, line=167 (Compiled frame) - This one I haven't seen before and it is synchronized per host. Here we have only 2 threads being blocked: org.ovirt.engine.core.vdsbroker.VdsIdVDSCommandBase.executeVDSCommand() @bci=16, line=47 (Compiled frame) org.ovirt.engine.core.vdsbroker.VdsIdVDSCommandBase.executeVDSCommand() @bci=16, line=47 (Compiled frame)
Need to check if this synchronization is needed. Anyway, worth revisiting during SPM removal, as the flows will probably change. Liron - I'm assigning to you to consider during your current refactoring.
let's split it to two different cases - the reproduction should be done again using the same transport layer. 1. the customer case, from what i've seen they use the http client 2. yuri's case, which uses json rpc. 1. the customer case - In the last pasted log on the case, i can see that a task is created successfully - then the engine just polls the tasks - removing the synchronization (at least on IrsBroker) will add more load on the http client. how did we came to the conclusion that the synchronization is the issue? in the case i can see that the lastCheck values are getting higher, which means that probably there is heavy load which might explain timeout and problem accessing the storage. 2.Yuri's logs- In the logs that yuri provided i can see errors in the transport (see example below), so removing the synchronization will just more load - what causes this exceptions? 5-143b-42fd-985e-7e2b7cffdd18, imageId = 7a9cb3cc-d4c6-4b2e-8f3e-5871d84bb164, dstImageGroupId = 92c1276d-c19a-41eb-a8e3-66a4e0f53259, vmId = 3da8d6ef-2ad2-49 a1-8d3b-fc30495d653e, dstImageId = bafc90d9-6199-41f6-84bd-7aa62219bdf1, imageDescription = Active VM, dstStorageDomainId = ee15b9b9-b98c-4da2-9d04-17b7947e2b 36, copyVolumeType = SharedVol, volumeFormat = RAW, preallocate = Sparse, postZero = false, force = false), exception: VDSNetworkException: VDSGenericExceptio n: VDSNetworkException: Heartbeat exeeded, log id: 8126f9d: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: VDSGenericException: VDSNetworkExce The thread dump that piotr analyzed does show locks, but it doesn't seem like the cause of the problem - in 3.6 those flows are gonna change anyway for data center >= 3.6, so i won't touch them unless it's mandatory.
(In reply to Liron Aravot from comment #7) > let's split it to two different cases - the reproduction should be done > again using the same transport layer. > 1. the customer case, from what i've seen they use the http client > 2. yuri's case, which uses json rpc. > > > 1. the customer case - > In the last pasted log on the case, i can see that a task is created > successfully - then the engine just polls the tasks - removing the > synchronization (at least on IrsBroker) will add more load on the http > client. how did we came to the conclusion that the synchronization is the > issue? We saw it in thread dump from the customer env. There is plenty of tasks which timeouts before being sent to vdsm. > in the case i can see that the lastCheck values are getting higher, which > means that probably there is heavy load which might explain timeout and > problem accessing the storage. In my opinion using synchronization as throttling mechanism is not the best idea and it will hit us hard in the future. The load should be handled on vdsm side in order to not overload storage. > > 2.Yuri's logs- > In the logs that yuri provided i can see errors in the transport (see > example below), so removing the synchronization will just more load - what > causes this exceptions? > > 5-143b-42fd-985e-7e2b7cffdd18, imageId = > 7a9cb3cc-d4c6-4b2e-8f3e-5871d84bb164, dstImageGroupId = > 92c1276d-c19a-41eb-a8e3-66a4e0f53259, vmId = 3da8d6ef-2ad2-49 > a1-8d3b-fc30495d653e, dstImageId = bafc90d9-6199-41f6-84bd-7aa62219bdf1, > imageDescription = Active VM, dstStorageDomainId = > ee15b9b9-b98c-4da2-9d04-17b7947e2b > 36, copyVolumeType = SharedVol, volumeFormat = RAW, preallocate = Sparse, > postZero = false, force = false), exception: VDSNetworkException: > VDSGenericExceptio > n: VDSNetworkException: Heartbeat exeeded, log id: 8126f9d: > org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: > VDSGenericException: VDSNetworkExce > Heartbeats exeeded are because of other issue BZ 1198680. > > The thread dump that piotr analyzed does show locks, but it doesn't seem > like the cause of the problem - in 3.6 those flows are gonna change anyway > for data center >= 3.6, so i won't touch them unless it's mandatory. I am only able to say about the data gathered during testing 3.6 but I haven't seen such data so I can't tell whether we need to fix it or not.
In this patch even still relevant in the current 3.6's codebase?
In my opinion it's relevant and we should re look into that in DC without SPM. let's wait for replies from Yuri and Piotr though.
I am not sure whether it is still relevant for 3.6. The changes were tested with 3.4 and 3.5. We would need to run the test one more time to understand whether this change improves performance.
I'm agree with Piotr. We could retest that flow on 3.6. Let me know if it still relevant and we'll put it in testing tasks queue.
Allon, what do you think?
Agreed. Yuri - let's add these to the scale team's queue. Thanks!
Added to scaleteam's queue. Will be verified on one of the stable build of 3.6
Bug verified on RHEVM: 3.6.3-0.1 KVM Version: 2.3.0 - 31.el7_2.7 LIBVIRT Version: libvirt-1.2.17-13.el7_2.3 VDSM Version: vdsm-4.17.19-0.el7ev Performed the same load test like described in bug. There is no Timeout errors happened Response time is better than early POST Copy Template Disk to Storage - 194327 ms ~ 3 min (instead of 1172119 ms ~ 19 min) POST Create VM Template - 207961 ms ~ 3 min (instead of 311459 ms ~ 5 min) The full response time tables: Label # Average 90%Line Min Max VM Create REST API Request 60 4596 8894 1048 16068 VM Status REST API Down 17377 43 44 25 302 VM Create Full Flow 60 25940 43165 7372 57043 VM Start REST API Request 60 985 1474 359 2677 VM Status REST API Up 3995 46 47 28 403 VM Start Full Flow 60 69390 75079 60561 78089 root.64.22:22 60 88 330 26 364 Stop VM REST API Request 60 1360 2087 722 13296 Stop VM Full Flow 60 13834 14987 2849 15345 Template Create REST APIRequest 60 4739 6907 2181 9514 Template Status REST API 9629 30 31 18 330 Template Create Full Flow 60 166251 207961 38319 226901 Template Get Disk ID REST API 60 14 15 13 17 Storage Get ID REST API 60 143 184 117 324 Template Copy Disk Request 60 5533 10290 1458 16933 Disk Status OK REST API 9298 15 16 9 341 Template Copy Full Disk Flow 60 158141 194327 45985 208160 Delete Template REST API 60 8112 11626 2746 18545 Delete VM REST API 60 5006 7581 859 12684 Attached engine and vdsm log files. Looks like synchronization issue has fixed and bug can be closed. But, during test raised errors which is looks like functional Need investigate it: 2016-02-23 10:47:04,764 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ajp-/127.0.0.1:8702-3) [74410bb2] Correlation ID: 1fe3327e, Job ID: 9086db57-102c-49b8-80d2-f70761e636d4, Call Stack: null, Custom Event ID: -1, Message: User admin@internal is copying disk testvm1_Disk1 to domain NFS-SD-23. 2016-02-23 10:50:04,578 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-42) [] Correlation ID: 1fe3327e, Job ID: 9086db57-102c-49b8-80d2-f70761e636d4, Call Stack: null, Custom Event ID: -1, Message: User admin@internal finished copying disk <UNKNOWN> to domain NFS-SD-23. 2016-02-23 10:50:04,579 ERROR [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (org.ovirt.thread.pool-6-thread-42) [] [within thread]: endAction for action type MoveOrCopyDisk threw an exception.: java.lang.NullPointerException at org.ovirt.engine.core.bll.MoveOrCopyDiskCommand.isTemplate(MoveOrCopyDiskCommand.java:424) [bll.jar:] at org.ovirt.engine.core.bll.MoveOrCopyDiskCommand.incrementDbGenerationForRelatedEntities(MoveOrCopyDiskCommand.java:313) [bll.jar:] at org.ovirt.engine.core.bll.MoveOrCopyDiskCommand.endSuccessfully(MoveOrCopyDiskCommand.java:306) [bll.jar:] at org.ovirt.engine.core.bll.CommandBase.internalEndSuccessfully(CommandBase.java:651) [bll.jar:] at org.ovirt.engine.core.bll.CommandBase.endActionInTransactionScope(CommandBase.java:595) [bll.jar:] at org.ovirt.engine.core.bll.CommandBase.runInTransaction(CommandBase.java:1984) [bll.jar:] at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInSuppressed(TransactionSupport.java:174) [utils.jar:] at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInScope(TransactionSupport.java:116) [utils.jar:] at org.ovirt.engine.core.bll.CommandBase.endAction(CommandBase.java:524) [bll.jar:] at org.ovirt.engine.core.bll.tasks.DecoratedCommand.endAction(DecoratedCommand.java:17) [bll.jar:] at org.ovirt.engine.core.bll.tasks.CoCoAsyncTaskHelper.endAction(CoCoAsyncTaskHelper.java:359) [bll.jar:] at org.ovirt.engine.core.bll.tasks.CommandCoordinatorImpl.endAction(CommandCoordinatorImpl.java:415) [bll.jar:] at org.ovirt.engine.core.bll.tasks.CommandAsyncTask.endCommandAction(CommandAsyncTask.java:168) [bll.jar:] at org.ovirt.engine.core.bll.tasks.CommandAsyncTask.access$000(CommandAsyncTask.java:30) [bll.jar:] at org.ovirt.engine.core.bll.tasks.CommandAsyncTask$1.run(CommandAsyncTask.java:111) [bll.jar:] at org.ovirt.engine.core.utils.threadpool.ThreadPoolUtil$InternalWrapperRunnable.run(ThreadPoolUtil.java:92) [utils.jar:] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [rt.jar:1.8.0_51] at java.util.concurrent.FutureTask.run(FutureTask.java:266) [rt.jar:1.8.0_51] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [rt.jar:1.8.0_51] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [rt.jar:1.8.0_51] at java.lang.Thread.run(Thread.java:745) [rt.jar:1.8.0_51]
Created attachment 1129698 [details] 3.6.3-0.1 engine log
Created attachment 1129699 [details] 3.6.3-0.1 vdsm log
Created attachment 1129700 [details] 3.6.3-0.1 vdsm 1 log
OK, I found bug which reported about issue with "MoveOrCopyDisk threw an exception.: java.lang.NullPointerException" https://bugzilla.redhat.com/show_bug.cgi?id=1302715 So, current bug can be closed
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHEA-2016-0376.html