Bug 1102147
Summary: | [scale] ReconstructMasterDomainCommand issue with 200 VMs | ||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Yuri Obshansky <yobshans> | ||||||||||||||||||||||||||||||||||||||||||
Component: | vdsm | Assignee: | Francesco Romani <fromani> | ||||||||||||||||||||||||||||||||||||||||||
Status: | CLOSED NOTABUG | QA Contact: | Yuri Obshansky <yobshans> | ||||||||||||||||||||||||||||||||||||||||||
Severity: | high | Docs Contact: | |||||||||||||||||||||||||||||||||||||||||||
Priority: | unspecified | ||||||||||||||||||||||||||||||||||||||||||||
Version: | 3.4.0 | CC: | aberezin, bazulay, cpelland, danken, dkuznets, dsulliva, emarcian, fromani, gklein, iheim, kabbott, laravot, lpeer, michal.skrivanek, nsoffer, oourfali, pkliczew, pstehlik, sherold, ybronhei, yeylon, yobshans | ||||||||||||||||||||||||||||||||||||||||||
Target Milestone: | --- | ||||||||||||||||||||||||||||||||||||||||||||
Target Release: | 3.5.1 | ||||||||||||||||||||||||||||||||||||||||||||
Hardware: | x86_64 | ||||||||||||||||||||||||||||||||||||||||||||
OS: | Linux | ||||||||||||||||||||||||||||||||||||||||||||
Whiteboard: | virt | ||||||||||||||||||||||||||||||||||||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||||||||||||||||||||||||||||||||||||
Doc Text: | Story Points: | --- | |||||||||||||||||||||||||||||||||||||||||||
Clone Of: | |||||||||||||||||||||||||||||||||||||||||||||
: | 1110835 1167229 (view as bug list) | Environment: | |||||||||||||||||||||||||||||||||||||||||||
Last Closed: | 2014-12-16 10:57:05 UTC | Type: | Bug | ||||||||||||||||||||||||||||||||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||||||||||||||||||||||||||||||||
Documentation: | --- | CRM: | |||||||||||||||||||||||||||||||||||||||||||
Verified Versions: | Category: | --- | |||||||||||||||||||||||||||||||||||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||||||||||||||||||||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||||||||||||||||||||||||||||||
Embargoed: | |||||||||||||||||||||||||||||||||||||||||||||
Bug Depends On: | 1113106, 1113116 | ||||||||||||||||||||||||||||||||||||||||||||
Bug Blocks: | 1110835, 1110848 | ||||||||||||||||||||||||||||||||||||||||||||
Attachments: |
|
Created attachment 900023 [details]
vdsm.log
+1 reproduced on 3.3 too Yuri: - In what rate were the VMs created ? - In bulks? all together ? Danken - does it smell like M2C ? It looks like it is similar to [1]. Can we get the size of the data to be sent as a response to GetAllVmStatsVDSCommand? [1] http://rt.openssl.org/Ticket/Display.html?id=598 Barak - one by one - no bulks Piotr - how can I measure size of data? The easiest way to get the response size is to log or print len(response). You can modify [1] and log the length just before write in do_POST method. [1] /usr/lib64/python2.6/site-packages/vdsm/utils.py It might be a m2crypto issue, but I doubt that, as it happens every 4 minutes plust 5 or 10 seconds. It's a bit too periodical. It might be a matter of Engine choking up the connection periodically - can you check if there's a spike in traffic during the time of errors? It looks like quarz job on the engine tries to get all vm stats every couple of minutes and due to some reason it never receives any answer. After some time it recovers and tries again. I think that we need to understand why writing of the response fails on the vdsm side. There is no source on this host machine so we can't modify it to see the size. I will try to recreate the bug by playing with the size of the response. Its a matter of load. running "vdsClinet getAllVmStats" on that host (which runs 200 vms) takes 4 minutes to return. The exception in VDSM is one issue that must be solved and it sounds like regression. From engine's prospective, 4 minutes must reach a timeout, we can't do much about it . Vdsm needs to response faster PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 23221 vdsm 0 -20 32.2g 900m 9m S 1066.6 0.2 110:59.49 vdsm 61074 root 20 0 7290m 31m 8112 S 114.9 0.0 12899:21 libvirtd 29402 root 20 0 17092 3356 1004 R 21.0 0.0 0:30.96 top 29582 qemu 20 0 1726m 1.2g 6364 S 8.3 0.3 439:08.29 qemu-kvm ... not sure it's suppose to work... have we ever checked 200*3 python's threads at same time? The response size if over 370k so following the issue from the link it seems that during sending this huge response there are new messages received and sent so try fails. There are similar issues [1][2] that people see with openssl. [1] http://stackoverflow.com/questions/20715408/solved-pyopenssl-ssl3-write-pendingbad-write-retry-return-self-sslobj-write [2] http://stackoverflow.com/questions/14695247/in-php-openssl-error-messages-error-1409f07f-ssl-routines-ssl3-write-pendin Created attachment 901431 [details]
VDSM response to get all vm stats
The talked host contains 400GB ram and 32 cpu cores. what limitation do we on such hosts? do we allow to ran 400 vms each with 1GB ram? don't we need to have a limitation based on cpus also? currently we're testing what is the maximum number of vms that this host can run before reaching to such slow response that engine throws timeout on adding Francesco. Could the getBalloonInfo be of help? I suppose with 200VMs eliminating the dom access may help at least with the responsiveness of the getstats call. Can we try VMs without balloon? worth a try... @Yaniv, 32GB virtual size is just ill, any idea why? We started reproduce that issue and find out the limit. We cleanup the RHEVM, reinstall vdsm on host machine and started populate VMs from 0 till get the exception. It happened on 182-th VMs. 2014-06-03 12:13:42,876 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetAllVmStatsVDSCommand] (DefaultQuartzScheduler_Worker-7) [17f97fa8] Command GetAllVmStatsVDSCommand(HostName = HOST-REAL, HostId = ee38bc5a-0ad7-4034-80a6-a2af84eac5dc, vds=Host[HOST-REAL,ee38bc5a-0ad7-4034-80a6-a2af84eac5dc]) execution failed. Exception: VDSNetworkException: java.util.concurrent.TimeoutException (In reply to Yaniv Bronhaim from comment #15) > don't we need > to have a limitation based on cpus also? I don't know if such a limitation exists on Engine side. In my opinion, it should not, as CPU over-commitment is one of the benefits of virtualization. If the guests do nothing, and Vdsm (and the rest of the management stack) is responsive - we should accept more VMs. (In reply to Yaniv Bronhaim from comment #11) > Its a matter of load. running "vdsClinet getAllVmStats" on that host (which > runs 200 vms) takes 4 minutes to return. The exception in VDSM is one issue > that must be solved and it sounds like regression. From engine's > prospective, 4 minutes must reach a timeout, we can't do much about it . > Vdsm needs to response faster For the response time, this patch may provide some relief: http://gerrit.ovirt.org/#/c/27696/5 I'm all in to make things faster, but we'll need some profile data for this, but this will in turn require VDSM >= 4.15 (yappi support was merged well after 4.14) Lastly, the amount of memory consumed is indeed worrysome. I'm thinking to add optional support for memory profiling as well. Yuri, I've tried Francesco's patch on my humble i7 quad-core dell with 125 faqemu vms and it got vdsClient getAllVmStats response time from 6 minutes to 1 seconds (!), can you try it? Yes, sure I'm preparing environment for bug verification right now Send me patch and guide how to install it by e-mail I think, we should first reproduce bug, only after that apply patch and again try reproduce. VDSM patch backported here: http://gerrit.ovirt.org/#/c/28686/1 I don't think the patch was meant to steal the bug completely Also @Yuri, be careful when answering on a bug with multiple needinfos:) (In reply to Dan Kenigsberg from comment #18) > to have a limitation based on cpus also? there's no such limitation. +1 on being able to absorb more (In reply to Yaniv Bronhaim from comment #12) > have we ever checked 200*3 python's threads at same time? why *3? also please see comment #16 I don't know the reason for this huge virtual memory capacity , but still looks that Francesco's patch have huge improvement impact on that. Dima will share profiling output and usage status soon for more investigation do we want to do anything else except the balloon here? If not, I'd take it under virt and put Francesco in charge of backports…. the backport already pushed but not verified [1]. Dima works with Yuri to find limitations and see if the patch helps on his environment. If we'll find out that Francesco's improvement is enough to fix the communication issue I'll forward you the bug to continue [1] http://gerrit.ovirt.org/#/c/28686 Looks like the patch fix the issue for this amount of vms on the same environment, it reduces the response for getAllVmStats for 30 seconds Applied patch has resolved our problem with 200 VMs Host is stable. No exceptions. But. We tried unsuccessfully create additional 100 VMs (increase number of VMs up to 300). Engine (or vdsm) started to lose VMs VM every minute changed state from UP to Non responsive and vice versa I found only that info message in engine.log 2014-06-18 11:17:47,316 INFO [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-46) [a94764c] VM scale_108 c03cdb08-bc8c-4649-b034-617492f87b71 moved from Up --> NotResponding 2014-06-18 11:17:47,324 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-46) [a94764c] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM scale_108 is not responding. ............ 2014-06-18 11:18:10,845 INFO [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-41) [2a9e7042] VM scale_108 c03cdb08-bc8c-4649-b034-617492f87b71 moved from NotResponding --> Up Looks like it is a limit - 200 VMs See attached new log files. Created attachment 909986 [details]
Engine log file after patch applied
Created attachment 909987 [details]
VDSM log after patch applied
Hi, the VDSM log is too tiny, there isn't much to see here (it covers ~18 seconds!). It is possible to have more VDSM logs? We'd probably greatly benefit from profiling data as Dan suggested in https://bugzilla.redhat.com/show_bug.cgi?id=1102147#c19 Yes, sure Uploaded 5 VDSM log files Created attachment 909992 [details]
VDSM 1
Created attachment 909993 [details]
VDSM 2
Created attachment 909994 [details]
VDSM 3
Created attachment 909995 [details]
VDSM 4
Created attachment 909996 [details]
VDSM 5
it looks like a nicely overloaded system. Can we get sar file? or at least a load value? it may be just that we're simply over the limit the host can handle... I'd like clarify the test condition. No load performed during test. Just created x number of VMs using REST API and JMeter. Hardware profile: HOST - 64 CPUs and 400 G RAM VMs - 4 CPUs and 1 G RAM (installed kernel and guest agent) I'm asking about system load indicator. And System Activity report to see the separate the idle load of the environment vs. what RHEV causes by querying would be great to get a comparison when running vdsm vs not running vdsm (with VMs still running) Looks like I found the number of VMs which are running w/o problems (changed state from UP to Non responsive and vice versa) in current hardware and software configuration and patch applied. It is 215 VMs. This bug also reproduced in small setup (532 hosts, 19 vms per host). any chance the problem related to number of hosts? engine version: 3.4.0-0.21.el6ev vdsm version: vdsm-4.14.7-3.el6ev.x86_64 probably not, the problem was the load of one host, each vm produces threads to collect stats and this really slowed down vdsm, the response to getAllVmStats took 6minutes. sharing logs might help to understand what happened in your environment Created attachment 911139 [details]
small setup log
Created attachment 911141 [details]
small setup log vdsm
Yuri, AFAIU we are running 215 VMs on a 32 core hypervisor (is it 32 or 16 with hyperthreading) ? Yaniv, Other areas to analyze: - profiling vdsm again for bottlenecks (AFAIU the xml parsing from libvirt is an issue) - investigate the memory consumption of vdsm (In reply to Barak from comment #47) > Yuri, > AFAIU we are running 215 VMs on a 32 core hypervisor (is it 32 or 16 with > hyperthreading) ? I see in comment 39 64 cores (same question about hyperthreading ?) Eldad, I checked the logs and still can see engine timing out when running GetAllVmStatsVDSCommand. There is one more issue in the engine log during ActivateVdsCommand: Transaction is not active. I checked vdsm log for any occurrences of getAllVmStats but I was not able to find. Can you provide vdsm log where we can see manifestation of timeout issue? (In reply to Barak from comment #48) (AFAIU the xml parsing from libvirt is an issue) there are http://gerrit.ovirt.org/#/c/17694/ and http://gerrit.ovirt.org/#/c/26856/ which should help any response to comment #40? @eldad, comment #43 - so you had 10000 VMs? (In reply to Michal Skrivanek from comment #40) > I'm asking about system load indicator. And System Activity report to see > the separate the idle load of the environment vs. what RHEV causes by > querying > > would be great to get a comparison when running vdsm vs not running vdsm > (with VMs still running) We don't have environment right now. What kind of load indicator? Please, specify. What is system activity report? What do you mean? (In reply to Barak from comment #48) > Yaniv, > > Other areas to analyze: > - profiling vdsm again for bottlenecks (AFAIU the xml parsing from libvirt > is an issue) > - investigate the memory consumption of vdsm We've looked into the virtual memory map of vdsm on Yuri's host and found a lot of maps with 0 rss and no access permissions, all in the range of 50~64 megabytes. These maps accounted for about 23 GB of virtual memory. After close examination, all of those maps are heaps malloc manages, here is a short overview of how it becomes so big: * Heaps are allocated on demand, meaning when a thread tries to allocate, it tries to lock any created heap, and if no heap is available, it creates a new one. * There is a limit on the maximum number of heaps a process will have, it depends on the amount *cores* on the machine. Max value is calculated this way (on a 64-bit machine): number of heaps <= (number of cores) * 8. * When creating a new heap, 64 megabyte mapping is created, with no access permissions, then, whenever process needs writable memory, it mprotects() a piece of the heap (splitting the mapping in two) and uses the writable chunk. * On a 64 core machine it can grow to 32 GB across 512 heaps. Other major contribution to virtual memory are thread stacks, with 8 megabytes per stack, having 300 threads means there is 2.4GB bump in the virtual memory. I'd like to find out which thread(s) actually consume too much rss so it all adds up to 900MB, well, but probably not that significant anyway Yuri, would you please let me know when you have the setup again just so I can take a look at the host? Martin i think i have reproduce the problem on 'vdsm-4.14.11-1.el6ev.x86_64' *** Bug 1132165 has been marked as a duplicate of this bug. *** Bug verification failed. RHEVM - 3.5.0-0.17.beta.el6ev VDSM - vdsm-4.16.7.1-1.el6ev I ran simple script which perform command getAllVmStats and measure response time See results: getAllVmStats response time w/o patch min 37.319 sec average 46.398 sec max 52.005 sec getAllVmStats response time with patch min 33.058 sec average 48.268 sec max 56.701 sec I suggest to open new bug . the fix here refers to virt by polling the vms status and it decreases the penalty of it - this was merged to 3.5 as the bug is targeted. although performance issues regarding running large scale of vms can lead to many bottle necks this bottle neck is solved in the bugzilla (fromani - please change the bug title according to the fix) Yuri, 5 second better is also something meaningful. but we need to understand more deeply what takes long and where to investigate. please open new bug with the latest logs and we'll try to investigate more and close this one (In reply to Yaniv Bronhaim from comment #59) > I suggest to open new bug . the fix here refers to virt by polling the vms > status and it decreases the penalty of it - this was merged to 3.5 as the > bug is targeted. although performance issues regarding running large scale > of vms can lead to many bottle necks > this bottle neck is solved in the bugzilla (fromani - please change the bug > title according to the fix) Done (In reply to Yuri Obshansky from comment #58) > Bug verification failed. > RHEVM - 3.5.0-0.17.beta.el6ev > VDSM - vdsm-4.16.7.1-1.el6ev > > I ran simple script which perform command getAllVmStats > and measure response time > See results: > getAllVmStats response time w/o patch > min 37.319 sec > average 46.398 sec > max 52.005 sec > > getAllVmStats response time with patch > min 33.058 sec > average 48.268 sec > max 56.701 sec Please specify which patch are you talking about. Also, please perform the test with both jsonrpc and xmlrpc and report the results for each. Nir - I added you in order to check the storage related issue. all the info is in this bug. Yuri has trouble reaching 200 vms on his system - and encountered a DC/sp issue (see comments 61 & 62), the needinfo on you is to check what is the problem. Yuri just to be on the safe side, you mentioned: 1 - comment #39 - using REST API and JMeter 2 - Host status is not stable (bug description) While on comment #58 you failed verification - you wrote "I ran simple script which perform command getAllVmStats" Did the host change status ? did you see still "synchronous VM balloon query causes VDSNetworkException and TimeoutException" ? It looks like you failed verification due to not improving the response of getAll VmStats enough ? Please be careful and Not clear other needinfos ;-) I ran script with getAllVmStats to test Francesco patch which is fix for bug 1110835(it is a clone of current bug). Anyway, I rerun test using original flow with only one difference - I repeated that flow (create 200 VMs) twice: 1. host created without jsonrpc protocol (using old xmlrpc) 2. host created with jsonrpc protocol. Executive summary: - Test with jsonrpc protocol failed. - VM creation process started to be slower and slower after we've created more that 80 VMs. - DC and Storage Domain crashed each 1/2 hours of running. - Last 111 VMs creation process got 7 minutes. - Environment was not functional from my point of view. Exceptions raised in engine log: 2014-11-16 14:22:45,120 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-7-thread-15) [69a1a5c] Correlation ID: 69a1a5c, Job ID: b1d0760c-c346-4037-84c5-30bc8b73c58e, Call Stack: null, Custom Event ID: -1, Message: Failed to Reconstruct Master Domain for Data Center DC-REAL. 2014-11-16 14:25:42,260 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (DefaultQuartzScheduler_Worker-60) [409d1fa0] ERROR, org.ovirt.engine.core.vdsbroker.irsbroker.GetStoragePoolInfoVDSCommand, exception: VDSNetworkException: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues', log id: 7d4be17c: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues' 2014-11-16 14:25:42,275 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (DefaultQuartzScheduler_Worker-60) [409d1fa0] IRS failover failed - cant allocate vds server 2014-11-16 14:25:42,520 ERROR [org.ovirt.engine.api.restapi.resource.AbstractBackendResource] (ajp-/127.0.0.1:8702-7) Operation Failed: [Cannot run VM. The relevant Storage Domain's status is Inactive.] Error raised in vdsm log: GuestMonitor-scale-107::DEBUG::2014-11-16 14:27:08,414::vm::486::vm.Vm::(_getUserCpuTuneInfo) vmId=`535fb4ef-a64b-4754-b700-87a8e32d31b5`::Domain Metadata is not set Thread-140622::DEBUG::2014-11-16 14:27:08,419::libvirtconnection::143::root::(wrapper) Unknown libvirterror: ecode: 80 edom: 20 level: 2 message: metadata not found: Requested metadata element is not present Host details: vdsm-4.16.7.1-1.el6ev libvirt-0.10.2-46.el6_6.1 32 CPU Cores 400 G RAM None of the above ERRORS has nothing to do with json-rpc, The only thing that may be relevant is """ org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (DefaultQuartzScheduler_Worker-60) [409d1fa0] ERROR, org.ovirt.engine.core.vdsbroker.irsbroker.GetStoragePoolInfoVDSCommand, exception: VDSNetworkException: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues', log id: 7d4be17c: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues' """ Yuri - please also re-run the same test with XML-RPC (same versions and hosts), Just to be on the safe side (it still looks like a storage related issue) Piotr - please take a look about the timeout above. Additional info: VM template details: 2 CPU, 1 G RAM Already done: Test 1 (xmlrpc) results: Successfully completed Test 2 (jsonrpc) test results: Test failed. Created attachment 958176 [details]
rhevm ui screenshot
Created attachment 958177 [details]
rhevm ui jsonrpc screenshot
Created attachment 958178 [details]
vdsm log jsonrpc
Created attachment 958179 [details]
engine log jsonrpc
Created attachment 958180 [details]
vdsm log xmlrpc
Created attachment 958181 [details]
engine log xmlrpc
I checked jsonrpc logs and I can see that during the test we reconstructed master domain. 2014-11-16 14:22:45,129 INFO [org.ovirt.engine.core.bll.eventqueue.EventQueueMonitor] (org.ovirt.thread.pool-7-thread-36) [1b84f6e9] Finished reconstruct for pool 0bc21fb5-d17d-45b7-8225-6ecefabca3f2. Clearing event queue There was no interaction between the engine and vdsm for long time (excluding heartbeats). Next log entry is: 2014-11-16 14:25:00,012 INFO [org.ovirt.engine.core.bll.AutoRecoveryManager] (DefaultQuartzScheduler_Worker-15) [1ad59ae9] Autorecovering 1 storage domains I am not sure what happened yet but from the vdsm log I can see that the last message attempted to send was a response to timeout message. Thread-161511:EBUG::2014-11-16 14:22:42,609::__init__::498::jsonrpc.JsonRpcServer:_serveRequest) Return 'StoragePool.getInfo' in bridge with {'info': {'name': 'No Description', 'isoprefix': '', 'pool_status': 'connected', 'lver': -1, 'domains': u'56b6b93d-f3f5-4075-98ae-85ce4a2d7b72:Active', 'master_uuid': u'56b6b93d-f3f5-4075-98ae-85ce4a2d7b72', 'version': '3', 'spm_id': -1, 'type': 'NFS', 'master_ver': 1}, 'dominfo': {u'56b6b93d-f3f5-4075-98ae-85ce4a2d7b72': {'status': u'Active', 'diskfree': '775436566528', 'isoprefix': '', 'alerts': [], 'disktotal': '805306368000', 'version': 3}}} I am not sure whether this is the response that we waited for. I would like to understand what arrived to the other side. Can you please provide engine.log on debug level? I've already uploaded engine.log -> engine log jsonrpc (123.78 KB, application/x-gzip) Is it not good? Yuri, The log uses info log level. I would need debug log level to understand what is received by the engine. Created attachment 958229 [details]
debug engine log
I've changed log level up to DEBUG and uploaded new engine.log It contains next actions: - activate Storage Domain, cause it was Inactive - start 2 VMs - Storage Domain crashes and change state to Inactive "2014-Nov-17, 14:34 Failed to Reconstruct Master Domain for Data Center DC-REAL." None of the messages timed out after setting the I can see the issue in the logs following actions taking place: - SpmStopVDSCommand - ResetIrsVDSCommand (closes the connecton engine <-> vdsm (nullifyInternalProxies)) - ReconstructMasterDomainCommand (fails) We close connection and reconnect later (2014-11-17 14:42) so we have 8 mins without irc connection. Is that intended? I got additional logs from Yuri and noticed that irs related commands timed out due to spmStop ending with disconnection. This is expected behavior during master reconstruction. I check master reconstruction flow and I noticed that StoragePool.reconstructMaster verb is not run at all. Here is what I could see in the logs on both ends: 2014-11-18 10:39:51,842 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.ResetIrsVDSCommand] (org.ovirt.thread.pool-7-thread-42) [68045ae6] FINISH, ResetIrsVDSCommand, log id: 29a77c5a 2014-11-18 10:39:51,850 DEBUG [org.ovirt.engine.core.bll.storage.ReconstructMasterDomainCommand] (org.ovirt.thread.pool-7-thread-42) [68045ae6] Command [id=791a6240-3038-40aa-9988-f382bb7040e9]: No compensation data. Thread-307701::DEBUG::2014-11-18 10:39:52,256::__init__::467::jsonrpc.JsonRpcServer::(_serveRequest) Calling 'StoragePool.spmStart' in bridge with {u'maxHostID': 250, u'storagepoolID': u'0bc21fb5-d17d-45b7-8225-6ecefabca3f2', u'prevLver': u'-1', u'enableScsiFencing': u'false', u'prevID': -1} Thread-307701::DEBUG::2014-11-18 10:39:53,005::__init__::498::jsonrpc.JsonRpcServer::(_serveRequest) Return 'StoragePool.spmStart' in bridge with 2859a77a-6b46-405b-9fac-3baa4ed6a50f Thread-307708::DEBUG::2014-11-18 10:39:55,438::__init__::467::jsonrpc.JsonRpcServer::(_serveRequest) Calling 'StoragePool.connectStorageServer' in bridge with {u'connectionParams': [{u'id': u'7027805c-7dd8-4c7d-a266-e645d927c103', u'connection': u'netapp.qa.lab.tlv.redhat.com:/vol/vol_rhev_stress/sd_1', u'iqn': u'', u'user': u'', u'tpgt': u'1', u'password': u'', u'port': u''}], u'storagepoolID': u'00000000-0000-0000-0000-000000000000', u'domainType': 1} Thread-307708::DEBUG::2014-11-18 10:40:17,825::__init__::498::jsonrpc.JsonRpcServer::(_serveRequest) Return 'StoragePool.connectStorageServer' in bridge with [{'status': 0, 'id': u'7027805c-7dd8-4c7d-a266-e645d927c103'}] Nir can you help to understand what is failing during reconstruct? I do not see any issue related to communication. Created attachment 958609 [details]
Latest engine and vdsm logs with debug
I think Liron can help with the problem mentioned in comment #83 This bug is becoming too complex to follow. I don't think the issues found in ReconstructMasterDomainCommand flow with json rpc, although of course important, belong in this bug. I suggest to open a new one to better track them. I also believe that is best to file a new one addressing the performance characteristics of this patch as concerns were raised in https://bugzilla.redhat.com/show_bug.cgi?id=1102147#c58 This bug was about avoiding timeouts and exceptions, and verfication *did not* proved it faulty (although of course more verification is welcome). Barak, Michal, what do you think about the above? Please let's continue here https://bugzilla.redhat.com/show_bug.cgi?id=1167229 for the performance concerns as raised in https://bugzilla.redhat.com/show_bug.cgi?id=1102147#c58 changing title to something less misleading. Moving to 3.5.1 to not block GA reconstruct wasn't executed here as there was one domain in the pool which all the hosts had problem to access - see the following repoStats result - Thread-307378::INFO::2014-11-18 10:33:13,394::logUtils::47::dispatcher::(wrapper) Run and protect: repoStats, Return response: {u'56b6b93d-f3f5-4075-98ae-85ce4a2d7b72': {'code': 0, 'version': 3, 'acquired': True, 'delay': '0.000537113', 'lastCheck': '74.3', 'valid': True}} by default, a report of lastCheck > 30 (that's a config value) marks a domain as problematic if after 5 minutes the domain isn't report as not problematic by the hosts in the data center, it will be moved to inactive status. in this case, i can see that the repoStats calls in the vdsm logs are in huge gaps (10-12 minutes)- that's related to the number of the running vms (i assume) - and is related to bug https://bugzilla.redhat.com/show_bug.cgi?id=1099068 - as currently the vm monitoring and host monitoring aren't separated. What needs to be done is a separation of the monitoring, regardless - the config value (of the lastCheck limit) can be increased under that heavy load. As detailed above in comment #89 , it looks like the fact that it succeeded in XML-RPC and failed in JSON-RPC is a pure matter of luck, unless we return from vdsm a wrong "lastCheck" number on json-rpc (long shot) looking at the code I see: [bazulay@bazulay vdsm (master)]$ git grep -i lastcheck vdsm/rpc/vdsmapi-schema.json:# @lastCheck: Elapsed time since the Storage Domain was last checked vdsm/rpc/vdsmapi-schema.json: 'data': {'code': 'int', 'delay': 'float', 'lastCheck': 'float', vdsm/storage/hsm.py: lastcheck = '%.1f' % (statsGenTime - domStatus.checkTime) vdsm/storage/hsm.py: 'lastCheck': lastcheck, Piotr can you please chack that the numbers are reported the same on XML-RPC & Json-rpc ? Here are the values that I was able to find in the log: - xmlrpc - 6.9, 1.3, 1.4, 6.7, 0.7, 9.0, 12.0 - jsonrpc - 68.9, 44.7, 24.6, 68.9, 44.7, 24.6 I mean in the same env while running engine with json-rpc - use the vdsClient to get it from xml-rpc and compare. Are these the numbers in comment #91 ? I am not sure what are you looking for but qhen running in the same env I can see that the numbers are similar in terms of order of magnitude. Like 2.7 and 0.4. We can see in bug #1168540 that with xmlrpc reconstruction master fails every hour. (In reply to Piotr Kliczewski from comment #93) > I am not sure what are you looking for but qhen running in the same env I > can see that the numbers are similar in terms of order of magnitude. > > Like 2.7 and 0.4. float might be reported slightly different ... and with higher numbers it might be the cause for the reported delay only through json-rpc (default configuration in engine is that when it passes 30 seconds ... you enter this flow) There is no difference in passing response to marshalling code. As stated in comment #94 there is the same issue using xmlrpc. I think that we are looking in wrong direction. We marshall many float values and we would have noticed the difference. As I understand we generate load and lastCheck values get higher. We can see that the values are higher than 30 so after 5 mins we report this domain as problematic which triggers master reconstruction. seems to me that there is no real difference in protocols, it's just a matter of luck when you cross 30s it gets triggered. I would close the bug if there is no objection also, see comment #89 for the possible explanation of problems. Once reporting is delayed so much, anything happens and falls apart I cannot reproduce it 100 % due the storage issues. Looks like not bug, so it will be closed |
Created attachment 900022 [details] engine.log Description of problem: Host state is not stable after population of 200 VMs Every 4-5 min host switched to Connecting state 2014-05-28 14:01:55,464 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetAllVmStatsVDSCommand] (DefaultQuartzScheduler_Worker-32) Command GetAllVmStatsVDSCommand(HostName = HOST-REAL, HostId = ee38bc5a-0ad7-4034-80a6-a2af84eac5dc, vds=Host[HOST-REAL,ee38bc5a-0ad7-4034-80a6-a2af84eac5dc]) execution failed. Exception: VDSNetworkException: java.util.concurrent.TimeoutException 2014-05-28 14:01:55,542 WARN [org.ovirt.engine.core.vdsbroker.VdsManager] (DefaultQuartzScheduler_Worker-32) Failed to refresh VDS , vds = ee38bc5a-0ad7-4034-80a6-a2af84eac5dc : HOST-REAL, VDS Network Error, continuing.java.util.concurrent.TimeoutException occurs in engine.log and Thread-886::ERROR::2014-05-28 13:22:28,627::SecureXMLRPCServer::168::root::(handle_error) client ('10.35.3.13', 59944) Traceback (most recent call last): File "/usr/lib64/python2.6/SocketServer.py", line 570, in process_request_thread File "/usr/lib64/python2.6/SocketServer.py", line 332, in finish_request File "/usr/lib64/python2.6/SocketServer.py", line 627, in __init__ File "/usr/lib64/python2.6/BaseHTTPServer.py", line 331, in handle File "/usr/lib64/python2.6/BaseHTTPServer.py", line 323, in handle_one_request File "/usr/lib64/python2.6/site-packages/vdsm/utils.py", line 223, in do_POST File "/usr/lib64/python2.6/socket.py", line 324, in write File "/usr/lib64/python2.6/socket.py", line 303, in flush File "/usr/lib64/python2.6/site-packages/M2Crypto/SSL/Connection.py", line 222, in write File "/usr/lib64/python2.6/site-packages/M2Crypto/SSL/Connection.py", line 205, in _write_bio SSLError: bad write retry occurs in vdsm.log Version-Release number of selected component (if applicable): RHEVM : 3.4.0-0.16.rc.el6ev OS Version: RHEV Hypervisor - 6.5 - 20140313.1.el6ev Kernel Version: 2.6.32 - 431.5.1.el6.x86_64 KVM Version: 0.12.1.2 - 2.415.el6_5.6 LIBVIRT Version: libvirt-0.10.2-29.el6_5.5 VDSM Version: vdsm-4.14.2-0.4.el6ev How reproducible: Steps to Reproduce: 1. Create 1 DC, 1 Cluster, 1 Host 2. Connect to iSCSI SD 3. Create 200 VMs Actual results: see above Expected results: Hst is stable Additional info: