Bug 1396057

Summary: Response time of the engine is significantly longer with RHVH hosts connected vs. regular RHEL hosts
Product: [oVirt] ovirt-engine Reporter: Nelly Credi <ncredi>
Component: Backend.CoreAssignee: Roy Golan <rgolan>
Status: CLOSED WONTFIX QA Contact: meital avital <mavital>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.0.5.5CC: bugs, cshao, dguo, eberman, fdeutsch, mperina, ncredi, oourfali, pstehlik, ycui
Target Milestone: ---Keywords: Performance
Target Release: ---Flags: sbonazzo: ovirt-4.2-
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-07-04 07:53:04 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Infra RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
execution with rhevh
none
execution with rhels
none
debug logs
none
Engine log on rhels only none

Description Nelly Credi 2016-11-17 11:23:50 UTC
Description of problem:
When running tests we can see a significant change in execution time, between environments with rhels and environments with rhvhs
notice: there is a difference when the rhvhs are connected, even without running anything on them (the tests are only using another rhel host which is also in the environment)


Version-Release number of selected component (if applicable):


How reproducible:
100%

Steps to Reproduce:
1. create a setup with only rhels
2. create a setup with at least one rhvh
3. run some rest request (we saw it in all requests, but major differences were seen in refresh capabilities & setup networks) and compare response time

Actual results:
I have multiple examples in the attached logs, but for example:
SEARCH:
2016-11-03 17:45:01,062 - MainThread - clusters - DEBUG - SEARCH request content is --  url:https://jenkins-vm-12.scl.lab.tlv.redhat.com/ovirt-engine/api/clusters?search=name%3Dgolden_env_mixed_1
2016-11-03 17:45:01,492 - MainThread - core_api - DEBUG - Request GET response time: 0.320

2016-11-10 16:49:50,374 - MainThread - clusters - DEBUG - SEARCH request content is --  url:https://jenkins-vm-12.scl.lab.tlv.redhat.com/ovirt-engine/api/clusters?search=name%3Dgolden_env_mixed_1
2016-11-10 16:49:50,593 - MainThread - core_api - DEBUG - Request GET response time: 0.140

PUT:
2016-11-03 18:01:32,851 - MainThread - macpools - INFO - Using Correlation-Id: macpools_update_4a69df48-ad13-45aa
2016-11-03 18:01:33,207 - MainThread - core_api - DEBUG - Request PUT response time: 0.400

2016-11-10 17:03:04,004 - MainThread - macpools - INFO - Using Correlation-Id: macpools_update_1cd61eb2-5d3f-47ea
2016-11-10 17:03:04,293 - MainThread - core_api - DEBUG - Request PUT response time: 0.160

DELETE:
2016-11-03 18:01:36,571 - MainThread - datacenters - DEBUG - DELETE request content is --  url:/ovirt-engine/api/datacenters/fdb4d54a-453d-4f1e-bbcb-64c4647fd8cf
2016-11-03 18:01:37,146 - MainThread - core_api - DEBUG - Request DELETE response time: 0.640

2016-11-10 17:03:06,612 - MainThread - datacenters - DEBUG - DELETE request content is --  url:/ovirt-engine/api/datacenters/5ce11483-317b-435d-a93c-a7f6ce53ca50
2016-11-10 17:03:07,087 - MainThread - core_api - DEBUG - Request DELETE response time: 0.160

Expected results:
Response time should be the same, no matter which host type is connected to the engine

Additional info:
With Roy's help we were able to see that when running explain analyze select * from vds;,
the response time with rhvhs connected is much longer than with only rhels connected
with rhels its around 2ms, actually <2ms most of the time
with rhvhs its around 2.8ms and sometimes above 3ms

we mostly exxecuted network tier1 tests
at the beginning we saw that on the rhels environment the tests took 23 mins and on an environment with rhevh it took 30 mins
we tried stopping the dwh service to see if it has affect, but it appears that it doesnt - last executions with and without it on an environment with rhevhs took 27-28 mins

Comment 1 Nelly Credi 2016-11-17 11:29:18 UTC
Created attachment 1221525 [details]
execution with rhevh

Comment 2 Nelly Credi 2016-11-17 11:30:07 UTC
Created attachment 1221526 [details]
execution with rhels

Comment 3 Oved Ourfali 2016-11-17 13:06:14 UTC
Can you try to narrow it down, and check whether specific parts took longer?
Any other differences between the two runs?

Comment 4 Nelly Credi 2016-11-17 15:25:00 UTC
we see it in each rest api call, the time until the engine answers is just longer.

Comment 5 Roy Golan 2016-11-22 14:15:58 UTC
Please supply the engine.log with DEBUG to spot where is the time spent.

Comment 6 Nelly Credi 2016-11-23 08:22:45 UTC
Created attachment 1223056 [details]
debug logs

Comment 7 Gil Klein 2016-11-24 14:43:43 UTC
From the logs analysis, I was able to count 1204 times where "The exclusive lock for key *** is released and lock is removed from map" freezes for about ~3 seconds before moving to the next operation. 

2016-11-16 03:06:11,838 DEBUG [org.ovirt.engine.core.bll.lock.InMemoryLockManager] (DefaultQuartzScheduler10) [62e8b35a] The exclusive lock for key 'e8933e9f-665e-4186-80bf-f17e94c44e8cVDS_INIT' is released and lock is removed from map

2016-11-16 03:06:14,139 DEBUG [org.ovirt.engine.core.bll.lock.InMemoryLockManager] (DefaultQuartzScheduler8) [36c519c3] Before acquiring lock 'EngineLock:{exclusiveLocks='[fec8135e-5910-4412-bc9a-bae12f1fa757=<VDS_INIT, >]', sharedLocks='null'}'

Comment 9 Fabian Deutsch 2016-12-01 10:11:27 UTC
Could it be related to gluster? In the logs I see:
2016-11-15 06:12:45,946 DEBUG [org.ovirt.engine.core.bll.gluster.GlusterSyncJob] (DefaultQuartzScheduler3) [4f6836ab] Refreshing Gluster Data [lightweight]
2016-11-15 06:12:45,946 DEBUG [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresJdbcTemplate] (DefaultQuartzScheduler3) [4f6836ab] Executing prepared SQL query
2016-11-15 06:12:45,946 DEBUG [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresJdbcTemplate] (DefaultQuartzScheduler3) [4f6836ab] Executing prepared SQL statement [select * from  getallfromcluster(?, ?)]
2016-11-15 06:12:46,548 DEBUG [org.ovirt.engine.core.bll.lock.InMemoryLockManager] (DefaultQuartzScheduler8) [66ca7030] Before acquiring lock 'EngineLock:{exclusiveLocks='[e8933e9f-665e-4186-80bf-f17e94c44e8c=<VDS_INIT, >]', sharedLocks='null'}'
2016-11-15 06:12:46,548 DEBUG [org.ovirt.engine.core.bll.lock.InMemoryLockManager] (DefaultQuartzScheduler8) [66ca7030] Success acquiring lock 'EngineLock:{exclusiveLocks='[e8933e9f-665e-4186-80bf-f17e94c44e8c=<VDS_INIT, >]', sharedLocks='null'}' succeeded
2016-11-15 06:12:46,548 DEBUG [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresJdbcTemplate] (DefaultQuartzScheduler8) [66ca7030] Executing prepared SQL query
2016-11-15 06:12:46,548 DEBUG [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresJdbcTemplate] (DefaultQuartzScheduler8) [66ca7030] Executing prepared SQL statement [select * from  getvdsbyvdsid(?, ?, ?)]
2016-11-15 06:12:46,554 DEBUG [org.ovirt.engine.core.bll.lock.InMemoryLockManager] (DefaultQuartzScheduler8) [66ca7030] Before releasing a lock 'EngineLock:{exclusiveLocks='[e8933e9f-665e-4186-80bf-f17e94c44e8c=<VDS_INIT, >]', sharedLocks='null'}'
2016-11-15 06:12:46,554 DEBUG [org.ovirt.engine.core.bll.lock.InMemoryLockManager] (DefaultQuartzScheduler8) [66ca7030] The exclusive lock for key 'e8933e9f-665e-4186-80bf-f17e94c44e8cVDS_INIT' is released and lock is removed from map
2016-11-15 06:12:49,555 DEBUG [org.ovirt.engine.core.bll.lock.InMemoryLockManager] (DefaultQuartzScheduler5) [7535ab8a] Before acquiring lock 'EngineLock:{exclusiveLocks='[e8933e9f-665e-4186-80bf-f17e94c44e8c=<VDS_INIT, >]', sharedLocks='null'}'
2016-11-15 06:12:49,555 DEBUG [org.ovirt.engine.core.bll.lock.InMemoryLockManager] (DefaultQuartzScheduler5) [7535ab8a] Success acquiring lock 'EngineLock:{exclusiveLocks='[e8933e9f-665e-4186-80bf-f17e94c44e8c=<VDS_INIT, >]', sharedLocks='null'}' succeeded


So the locking mentioned in comment 7 seems to be accompanied by the gluster refresh.

Nelly, could you please provide the logs of the a good run? (So the RHEL logs for the same run you used to gather the logs in comment 6)

Comment 10 Nelly Credi 2016-12-01 11:00:28 UTC
We have the same gluster version on rhels & rhvhs
glusterfs-3.7.9-12.el7.x86_64

Comment 11 Nelly Credi 2016-12-01 11:02:06 UTC
Created attachment 1226750 [details]
Engine log on rhels only

Comment 12 Oved Ourfali 2017-07-04 07:53:04 UTC
Not sure whether this still reproduces, but we haven't identified any change that might cause this.

Closing as wontfix.