Bug 1193509
Summary: | [SCALE] Rhevm synchronization bottlenecks - vm template and copy disk slowness | ||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Yuri Obshansky <yobshans> | ||||||||||||||||
Component: | ovirt-engine | Assignee: | Liron Aravot <laravot> | ||||||||||||||||
Status: | CLOSED ERRATA | QA Contact: | Eldad Marciano <emarcian> | ||||||||||||||||
Severity: | high | Docs Contact: | |||||||||||||||||
Priority: | high | ||||||||||||||||||
Version: | 3.5.0 | CC: | acanan, aefrat, amureini, iheim, laravot, lpeer, lsurette, pkliczew, rbalakri, Rhev-m-bugs, tspeetje, yeylon, ykaul, ylavi, yobshans | ||||||||||||||||
Target Milestone: | ovirt-3.6.3 | Keywords: | TestOnly | ||||||||||||||||
Target Release: | 3.6.0 | ||||||||||||||||||
Hardware: | x86_64 | ||||||||||||||||||
OS: | Linux | ||||||||||||||||||
Whiteboard: | |||||||||||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||||||||||
Doc Text: | Story Points: | --- | |||||||||||||||||
Clone Of: | Environment: | ||||||||||||||||||
Last Closed: | 2016-03-09 20:57:49 UTC | Type: | Bug | ||||||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||||||
Documentation: | --- | CRM: | |||||||||||||||||
Verified Versions: | Category: | --- | |||||||||||||||||
oVirt Team: | Scale | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||||
Embargoed: | |||||||||||||||||||
Attachments: |
|
Description
Yuri Obshansky
2015-02-17 13:47:15 UTC
Created attachment 992685 [details]
engine.log file
Created attachment 992686 [details]
vdsm.log file
Created attachment 992687 [details]
thread dump
Created attachment 992688 [details]
Test results report
I took a look at the thread dump and I can see two places in the code where I can see synchronization blocking issue. - This is exactly the same issue as we saw in customer env. Here we have only 2 threads being blocked: org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand.executeVDSCommand() @bci=11, line=153 (Compiled frame) org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData._updatingTimer_Elapsed() @bci=6, line=167 (Compiled frame) - This one I haven't seen before and it is synchronized per host. Here we have only 2 threads being blocked: org.ovirt.engine.core.vdsbroker.VdsIdVDSCommandBase.executeVDSCommand() @bci=16, line=47 (Compiled frame) org.ovirt.engine.core.vdsbroker.VdsIdVDSCommandBase.executeVDSCommand() @bci=16, line=47 (Compiled frame) Need to check if this synchronization is needed. Anyway, worth revisiting during SPM removal, as the flows will probably change. Liron - I'm assigning to you to consider during your current refactoring. let's split it to two different cases - the reproduction should be done again using the same transport layer. 1. the customer case, from what i've seen they use the http client 2. yuri's case, which uses json rpc. 1. the customer case - In the last pasted log on the case, i can see that a task is created successfully - then the engine just polls the tasks - removing the synchronization (at least on IrsBroker) will add more load on the http client. how did we came to the conclusion that the synchronization is the issue? in the case i can see that the lastCheck values are getting higher, which means that probably there is heavy load which might explain timeout and problem accessing the storage. 2.Yuri's logs- In the logs that yuri provided i can see errors in the transport (see example below), so removing the synchronization will just more load - what causes this exceptions? 5-143b-42fd-985e-7e2b7cffdd18, imageId = 7a9cb3cc-d4c6-4b2e-8f3e-5871d84bb164, dstImageGroupId = 92c1276d-c19a-41eb-a8e3-66a4e0f53259, vmId = 3da8d6ef-2ad2-49 a1-8d3b-fc30495d653e, dstImageId = bafc90d9-6199-41f6-84bd-7aa62219bdf1, imageDescription = Active VM, dstStorageDomainId = ee15b9b9-b98c-4da2-9d04-17b7947e2b 36, copyVolumeType = SharedVol, volumeFormat = RAW, preallocate = Sparse, postZero = false, force = false), exception: VDSNetworkException: VDSGenericExceptio n: VDSNetworkException: Heartbeat exeeded, log id: 8126f9d: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: VDSGenericException: VDSNetworkExce The thread dump that piotr analyzed does show locks, but it doesn't seem like the cause of the problem - in 3.6 those flows are gonna change anyway for data center >= 3.6, so i won't touch them unless it's mandatory. (In reply to Liron Aravot from comment #7) > let's split it to two different cases - the reproduction should be done > again using the same transport layer. > 1. the customer case, from what i've seen they use the http client > 2. yuri's case, which uses json rpc. > > > 1. the customer case - > In the last pasted log on the case, i can see that a task is created > successfully - then the engine just polls the tasks - removing the > synchronization (at least on IrsBroker) will add more load on the http > client. how did we came to the conclusion that the synchronization is the > issue? We saw it in thread dump from the customer env. There is plenty of tasks which timeouts before being sent to vdsm. > in the case i can see that the lastCheck values are getting higher, which > means that probably there is heavy load which might explain timeout and > problem accessing the storage. In my opinion using synchronization as throttling mechanism is not the best idea and it will hit us hard in the future. The load should be handled on vdsm side in order to not overload storage. > > 2.Yuri's logs- > In the logs that yuri provided i can see errors in the transport (see > example below), so removing the synchronization will just more load - what > causes this exceptions? > > 5-143b-42fd-985e-7e2b7cffdd18, imageId = > 7a9cb3cc-d4c6-4b2e-8f3e-5871d84bb164, dstImageGroupId = > 92c1276d-c19a-41eb-a8e3-66a4e0f53259, vmId = 3da8d6ef-2ad2-49 > a1-8d3b-fc30495d653e, dstImageId = bafc90d9-6199-41f6-84bd-7aa62219bdf1, > imageDescription = Active VM, dstStorageDomainId = > ee15b9b9-b98c-4da2-9d04-17b7947e2b > 36, copyVolumeType = SharedVol, volumeFormat = RAW, preallocate = Sparse, > postZero = false, force = false), exception: VDSNetworkException: > VDSGenericExceptio > n: VDSNetworkException: Heartbeat exeeded, log id: 8126f9d: > org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: > VDSGenericException: VDSNetworkExce > Heartbeats exeeded are because of other issue BZ 1198680. > > The thread dump that piotr analyzed does show locks, but it doesn't seem > like the cause of the problem - in 3.6 those flows are gonna change anyway > for data center >= 3.6, so i won't touch them unless it's mandatory. I am only able to say about the data gathered during testing 3.6 but I haven't seen such data so I can't tell whether we need to fix it or not. In this patch even still relevant in the current 3.6's codebase? In my opinion it's relevant and we should re look into that in DC without SPM. let's wait for replies from Yuri and Piotr though. I am not sure whether it is still relevant for 3.6. The changes were tested with 3.4 and 3.5. We would need to run the test one more time to understand whether this change improves performance. I'm agree with Piotr. We could retest that flow on 3.6. Let me know if it still relevant and we'll put it in testing tasks queue. Allon, what do you think? Agreed. Yuri - let's add these to the scale team's queue. Thanks! Added to scaleteam's queue. Will be verified on one of the stable build of 3.6 Bug verified on RHEVM: 3.6.3-0.1 KVM Version: 2.3.0 - 31.el7_2.7 LIBVIRT Version: libvirt-1.2.17-13.el7_2.3 VDSM Version: vdsm-4.17.19-0.el7ev Performed the same load test like described in bug. There is no Timeout errors happened Response time is better than early POST Copy Template Disk to Storage - 194327 ms ~ 3 min (instead of 1172119 ms ~ 19 min) POST Create VM Template - 207961 ms ~ 3 min (instead of 311459 ms ~ 5 min) The full response time tables: Label # Average 90%Line Min Max VM Create REST API Request 60 4596 8894 1048 16068 VM Status REST API Down 17377 43 44 25 302 VM Create Full Flow 60 25940 43165 7372 57043 VM Start REST API Request 60 985 1474 359 2677 VM Status REST API Up 3995 46 47 28 403 VM Start Full Flow 60 69390 75079 60561 78089 root.64.22:22 60 88 330 26 364 Stop VM REST API Request 60 1360 2087 722 13296 Stop VM Full Flow 60 13834 14987 2849 15345 Template Create REST APIRequest 60 4739 6907 2181 9514 Template Status REST API 9629 30 31 18 330 Template Create Full Flow 60 166251 207961 38319 226901 Template Get Disk ID REST API 60 14 15 13 17 Storage Get ID REST API 60 143 184 117 324 Template Copy Disk Request 60 5533 10290 1458 16933 Disk Status OK REST API 9298 15 16 9 341 Template Copy Full Disk Flow 60 158141 194327 45985 208160 Delete Template REST API 60 8112 11626 2746 18545 Delete VM REST API 60 5006 7581 859 12684 Attached engine and vdsm log files. Looks like synchronization issue has fixed and bug can be closed. But, during test raised errors which is looks like functional Need investigate it: 2016-02-23 10:47:04,764 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ajp-/127.0.0.1:8702-3) [74410bb2] Correlation ID: 1fe3327e, Job ID: 9086db57-102c-49b8-80d2-f70761e636d4, Call Stack: null, Custom Event ID: -1, Message: User admin@internal is copying disk testvm1_Disk1 to domain NFS-SD-23. 2016-02-23 10:50:04,578 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-42) [] Correlation ID: 1fe3327e, Job ID: 9086db57-102c-49b8-80d2-f70761e636d4, Call Stack: null, Custom Event ID: -1, Message: User admin@internal finished copying disk <UNKNOWN> to domain NFS-SD-23. 2016-02-23 10:50:04,579 ERROR [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (org.ovirt.thread.pool-6-thread-42) [] [within thread]: endAction for action type MoveOrCopyDisk threw an exception.: java.lang.NullPointerException at org.ovirt.engine.core.bll.MoveOrCopyDiskCommand.isTemplate(MoveOrCopyDiskCommand.java:424) [bll.jar:] at org.ovirt.engine.core.bll.MoveOrCopyDiskCommand.incrementDbGenerationForRelatedEntities(MoveOrCopyDiskCommand.java:313) [bll.jar:] at org.ovirt.engine.core.bll.MoveOrCopyDiskCommand.endSuccessfully(MoveOrCopyDiskCommand.java:306) [bll.jar:] at org.ovirt.engine.core.bll.CommandBase.internalEndSuccessfully(CommandBase.java:651) [bll.jar:] at org.ovirt.engine.core.bll.CommandBase.endActionInTransactionScope(CommandBase.java:595) [bll.jar:] at org.ovirt.engine.core.bll.CommandBase.runInTransaction(CommandBase.java:1984) [bll.jar:] at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInSuppressed(TransactionSupport.java:174) [utils.jar:] at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInScope(TransactionSupport.java:116) [utils.jar:] at org.ovirt.engine.core.bll.CommandBase.endAction(CommandBase.java:524) [bll.jar:] at org.ovirt.engine.core.bll.tasks.DecoratedCommand.endAction(DecoratedCommand.java:17) [bll.jar:] at org.ovirt.engine.core.bll.tasks.CoCoAsyncTaskHelper.endAction(CoCoAsyncTaskHelper.java:359) [bll.jar:] at org.ovirt.engine.core.bll.tasks.CommandCoordinatorImpl.endAction(CommandCoordinatorImpl.java:415) [bll.jar:] at org.ovirt.engine.core.bll.tasks.CommandAsyncTask.endCommandAction(CommandAsyncTask.java:168) [bll.jar:] at org.ovirt.engine.core.bll.tasks.CommandAsyncTask.access$000(CommandAsyncTask.java:30) [bll.jar:] at org.ovirt.engine.core.bll.tasks.CommandAsyncTask$1.run(CommandAsyncTask.java:111) [bll.jar:] at org.ovirt.engine.core.utils.threadpool.ThreadPoolUtil$InternalWrapperRunnable.run(ThreadPoolUtil.java:92) [utils.jar:] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [rt.jar:1.8.0_51] at java.util.concurrent.FutureTask.run(FutureTask.java:266) [rt.jar:1.8.0_51] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [rt.jar:1.8.0_51] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [rt.jar:1.8.0_51] at java.lang.Thread.run(Thread.java:745) [rt.jar:1.8.0_51] Created attachment 1129698 [details]
3.6.3-0.1 engine log
Created attachment 1129699 [details]
3.6.3-0.1 vdsm log
Created attachment 1129700 [details]
3.6.3-0.1 vdsm 1 log
OK, I found bug which reported about issue with "MoveOrCopyDisk threw an exception.: java.lang.NullPointerException" https://bugzilla.redhat.com/show_bug.cgi?id=1302715 So, current bug can be closed Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHEA-2016-0376.html |