Bug 1169100
Summary: | UpdateVM sometimes fails using JsonRPC (request doesn't reach vdsm). | ||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Liron Aravot <laravot> | ||||||||||||||||||||
Component: | ovirt-engine | Assignee: | Piotr Kliczewski <pkliczew> | ||||||||||||||||||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Petr Matyáš <pmatyas> | ||||||||||||||||||||
Severity: | urgent | Docs Contact: | |||||||||||||||||||||
Priority: | high | ||||||||||||||||||||||
Version: | 3.5.0 | CC: | aberezin, acanan, amureini, cmestreg, dron, ebenahar, ecohen, eedri, fsimonce, gklein, iheim, laravot, lkuchlan, lpeer, lsurette, mlipchuk, oourfali, pstehlik, ratamir, rbalakri, scohen, smizrahi, srevivo, tnisan, yeylon | ||||||||||||||||||||
Target Milestone: | --- | Keywords: | Regression, Reopened | ||||||||||||||||||||
Target Release: | 3.5.0 | ||||||||||||||||||||||
Hardware: | x86_64 | ||||||||||||||||||||||
OS: | Linux | ||||||||||||||||||||||
Whiteboard: | infra | ||||||||||||||||||||||
Fixed In Version: | org.ovirt.engine-root-3.5.0-25 | Doc Type: | Bug Fix | ||||||||||||||||||||
Doc Text: | Story Points: | --- | |||||||||||||||||||||
Clone Of: | 966103 | Environment: | |||||||||||||||||||||
Last Closed: | 2015-02-17 17:12:27 UTC | Type: | Bug | ||||||||||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||||||||||
Documentation: | --- | CRM: | |||||||||||||||||||||
Verified Versions: | Category: | --- | |||||||||||||||||||||
oVirt Team: | Infra | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||||||||
Embargoed: | |||||||||||||||||||||||
Bug Depends On: | |||||||||||||||||||||||
Bug Blocks: | 966103, 1164308, 1164311 | ||||||||||||||||||||||
Attachments: |
|
Description
Liron Aravot
2014-11-30 10:07:30 UTC
lkuchlan, let's try to reproduce the issue using xml rpc to be sure that it occures only when using json rpc just to be on the safe side. thanks. this bug was opened to not block performing RCA on it, as the issue seems to be using json rpc (that never occurred before/on other versions). Piotr - If this doesn't reproduce in XMLRPC, then please look at it in depth. If it will reproduce in XMLRPC then it is a storage issue. forgot to mention, other calls succeeded during that time 2014-11-30 10:47:59,024 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.GetImageInfoVDSCommand] (org.ovirt.thread.pool-7-thread-9) [7006144] START, GetImageInfoVDSCommand( storagePoolId = 00000002-0002-0002-0002-0000000000b8, ignoreFailoverLimit = false, storageDomainId = 2b6bdbe2-bf61-4688-ad2f-2dcbe893ed64, imageGroupId = 1f2636a8-f1e3-4404-bd1a-db2ae77584f2, imageId = c76407ea-f19f-4896-b428-8007d0fa0b1f), log id: c46cca 2014-11-30 10:47:59,265 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.GetImageInfoVDSCommand] (org.ovirt.thread.pool-7-thread-9) [7006144] FINISH, GetImageInfoVDSCommand, return: org.ovirt.engine.core.common.businessentities.DiskImage@6b68ce48, log id: c46cca 2014-11-30 10:48:00,409 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.UpdateVMVDSCommand] (org.ovirt.thread.pool-7-thread-9) START, UpdateVMVDSCommand( storagePoolId = 00000002-0002-0002-0002-0000000000b8, ignoreFailoverLimit = false, storageDomainId = bb5d6fa6-20f5-4746-8354-7fce9b52e003, infoDictionary.size = 1), log id: 6bfb322f 2014-11-30 10:51:00,918 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.UpdateVMVDSCommand] (org.ovirt.thread.pool-7-thread-9) Command UpdateVMVDSCommand( storagePoolId = 00000002-0002-0002-0002-0000000000b8, ignoreFailoverLimit = false, storageDomainId = bb5d6fa6-20f5-4746-8354-7fce9b52e003, infoDictionary.size = 1) execution failed. Exception: IrsOperationFailedNoFailoverException: IRSGenericException: IRSErrorException: Message timeout which can be caused by communication issues (In reply to Oved Ourfali from comment #2) > Piotr - If this doesn't reproduce in XMLRPC, then please look at it in depth. > If it will reproduce in XMLRPC then it is a storage issue. Agreed. If the initial analysis proves incorrect, please move it back. Regardless of where the issue is (storage or infra), I'm flagging this bug as a potential blocker. If this it true, it's a regression in 3.5.0. I checked the logs available in bug #966103 and I noticed that StoragePool.updateVMs seems to be not invoked on vdsm side. In the engine logs I can see that the engine is waiting for 3 mins without any activity and timeouts. I am not able to see whether the data was not sent or not received. Can you please set engine log level to debug, run the test one more time and attach the logs. Created attachment 963191 [details]
logs
The attached logs are relevant the original logs (not at debug level). lkuchlan, can you provide piotr with the needed logs as debug level? thanks. Created attachment 963582 [details]
logs and images
Liron, Please find attached the logs reproduced using XMLRPC and JSON.
With xml rpc it works, but with json rpc, after exporting VM based on a template, the VM gets blocked constantly with a status showing image is locked
Thank you for providing the logs but you use info log level. Can you please set log level to debug? I am not sure whether the data was not sent or not received. I tried to follow the steps to reproduce the issue and here are the results: I built latest 3.5 of the engine and vdsm. I run both on f20. I used jsonrpc. Steps to Reproduce: 1. create a vm based on template => OK (based on CirrOS imported as template) 2. create a snapshot => OK 3. export the vm without the template => OK 4. try to import the vm => I saw import conflict dialogue and I changed the name of vm and I got: "Cannot import and collapse VM, Template is missing in domain export" 5. export the template => When I clicked OK I saw: "Cannot export Template. The Storage Domain already contains the target disk(s)." After performing above operations I greped the logs and saw that UpdateVMVDSCommand was successful on the engine side and from vdsm side I saw: Thread-634::DEBUG::2014-12-03 11:37:36,504::__init__::500::jsonrpc.JsonRpcServer::(_serveRequest) Return 'StoragePool.updateVMs' in bridge with True Can you please suggest how to reproduce it? Please retest with new build which should be available today. Created attachment 964147 [details]
logs as debug level
Tested on RHEVM 3.5 vt12
Thank you for new logs. It looks like the data was sent to the other side. In order to confirm can you please provide wireshark cap file? I performed the same steps as in comment #11 with the difference that I installed host from qa-3.50-latest repo on rhel66. I had exactly the same results as in the comment. Created attachment 964727 [details]
logs and cap files
Reproduced:
2014-12-04 17:19:26,465 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.UpdateVMVDSCommand] (org.ovirt.thread.pool-7-thread-49) START, UpdateVMVDSCommand( storagePoolId = 00000002-0002-0002-0002-0000000002e1, ignoreFailoverLimit = false, storageDomainId = 05671944-6611-4331-ac4b-6330d1293132, infoDictionary.size = 1), log id: 5765f455
2014-12-04 17:20:05,068 INFO [org.ovirt.engine.core.bll.tasks.AsyncTaskManager] (DefaultQuartzScheduler_Worker-72) Setting new tasks map. The map contains now 1 tasks
2014-12-04 17:20:07,390 INFO [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-81) VM vm-7 bc3a1c58-3bd9-4f40-ae2e-648803bb6c1e moved from PoweringUp --> Up
2014-12-04 17:20:07,405 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-81) Correlation ID: 47dc1641, Job ID: f9f49b55-b659-4303-8a83-ffac7d514e16, Call Stack: null, Custom Event ID: -1, Message: VM vm-7 started on Host camel-vdsb
2014-12-04 17:22:26,833 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.UpdateVMVDSCommand] (org.ovirt.thread.pool-7-thread-49) Command UpdateVMVDSCommand( storagePoolId = 00000002-0002-0002-0002-0000000002e1, ignoreFailoverLimit = false, storageDomainId = 05671944-6611-4331-ac4b-6330d1293132, infoDictionary.size = 1) execution failed. Exception: IrsOperationFailedNoFailoverException: IRSGenericException: IRSErrorException: Message timeout which can be caused by communication issues
2014-12-04 17:22:26,833 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.UpdateVMVDSCommand] (org.ovirt.thread.pool-7-thread-49) FINISH, UpdateVMVDSCommand, log id: 5765f455
2014-12-04 17:22:26,850 INFO [org.ovirt.engine.core.utils.transaction.TransactionSupport] (org.ovirt.thread.pool-7-thread-49) transaction rolled back
2014-12-04 17:22:26,860 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-7-thread-49) Correlation ID: 5efdd286, Call Stack: null, Custom Event ID: -1, Message: Failed to export Vm vm-6 to export
2014-12-04 17:22:26,860 ERROR [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (org.ovirt.thread.pool-7-thread-49) [within thread]: endAction for action type ExportVm threw an exception.
2014-12-04 17:22:26,860 ERROR [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (org.ovirt.thread.pool-7-thread-49) org.ovirt.engine.core.common.errors.VdcBLLException: VdcBLLException: org.ovirt.engine.core.vdsbroker.irsbroker.IrsOperationFailedNoFailoverException: IRSGenericException: IRSErrorException: Message timeout which can be caused by communication issues (Failed with error VDS_NETWORK_ERROR and code 5022)
2014-12-04 17:22:26,860 INFO [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (org.ovirt.thread.pool-7-thread-49) CommandAsyncTask::HandleEndActionResult [within thread]: endAction for action type ExportVm completed, handling the result.
Performed scenario #2 until the export pahse:
create a vm from template and run the vm -> LSM the vm -> shut down the vm -> export the vm
logs and cap files attached
Created attachment 964785 [details]
screenshot
Please look at the attached screenshot.
I'm not familiar with such behaviour.
Which version of wireshark are you using I am using 1.10.11 and I am not able to open the files. I missed LSM step in my testing so will try to reproduce it one more time. I checked the log files and tcpdumps and it looks like the engine sends the message: 17:22:36.727627 IP vm-162-38.scl.lab.tlv.redhat.com.54429 > 10.35.116.2.54321: Flags [.], seq 2781:5677, ack 4415, win 205, options [nop,nop,TS val 1839070084 ecr 6681229], length 2896 17:22:36.727668 IP vm-162-38.scl.lab.tlv.redhat.com.54429 > 10.35.116.2.54321: Flags [.], seq 5677:8573, ack 4415, win 205, options [nop,nop,TS val 1839070084 ecr 6681229], length 2896 17:22:36.727676 IP vm-162-38.scl.lab.tlv.redhat.com.54429 > 10.35.116.2.54321: Flags [.], seq 8573:11469, ack 4415, win 205, options [nop,nop,TS val 1839070084 ecr 6681229], length 2896 17:22:36.727686 IP vm-162-38.scl.lab.tlv.redhat.com.54429 > 10.35.116.2.54321: Flags [.], seq 11469:14365, ack 4415, win 205, options [nop,nop,TS val 1839070084 ecr 6681229], length 2896 17:22:36.727694 IP vm-162-38.scl.lab.tlv.redhat.com.54429 > 10.35.116.2.54321: Flags [.], seq 14365:17261, ack 4415, win 205, options [nop,nop,TS val 1839070084 ecr 6681229], length 2896 17:22:36.728297 IP 10.35.116.2.54321 > vm-162-38.scl.lab.tlv.redhat.com.54429: Flags [.], ack 17261, win 249, options [nop,nop,TS val 6681405 ecr 1839070084], length 0 I run following steps: create a vm from template and run the vm -> LSM the vm -> shut down the vm -> export the vm using engine on f20 and host on f20. The test was successful. (In reply to Piotr Kliczewski from comment #21) > I run following steps: > > create a vm from template and run the vm -> LSM the vm -> shut down the vm > -> export the vm > > using engine on f20 and host on f20. The test was successful. I guess you're already on it, but in case you're not, I'd try rhel as well. I attempted to reproduce it on lkuchlan environment but I was not able to perform LSM. When I offline move there was no issuse. I started to analyze logs provided by Elad and noticed that every time UpdateVMVDSCommand is run. There is following line in vdsm.log Thread-1909::DEBUG::2014-12-04 17:22:39,441::libvirtconnection::143::root::(wrapper) Unknown libvirterror: ecode: 80 edom: 20 level: 2 message: metadata not found: Requested metadata element is not present I have seen that libvirt was able to make vdsm not responsive. I started to play with reproductor on Elad's env. I modified betterAsyncore.py with recv method and I am able to see message on vdms side: JsonRpc (StompReactor)::DEBUG::2014-12-05 15:41:15,104::betterAsyncore::289::root::(recv) SEND destination:/queue/_local/vdsm/requests content-length:23417 {"jsonrpc":"2.0","method":"StoragePool.updateVMs","params":{"sto JsonRpc (StompReactor)::DEBUG::2014-12-05 15:41:15,107::betterAsyncore::289::root::(recv) :size=\"10 JsonRpc (StompReactor)::DEBUG::2014-12-05 15:41:15,110::betterAsyncore::289::root::(recv) None The message seems to be not complete and is never emitted for processing. Created attachment 965251 [details]
cap files and logs (1)
Uploading the correct cap files, along with the logs from engine and vdsm.
2014-12-05 23:08:02,760 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.UpdateVMVDSCommand] (org.ovirt.thread.pool-7-thread-33) Command UpdateVMVDSCommand( storagePoolId = 0000
0002-0002-0002-0002-0000000002e1, ignoreFailoverLimit = false, storageDomainId = 05671944-6611-4331-ac4b-6330d1293132, infoDictionary.size = 1) execution failed. Exception: IrsO
perationFailedNoFailoverException: IRSGenericException: IRSErrorException: Message timeout which can be caused by communication issues
Removing the need-info Created attachment 966159 [details]
new vdsm log
Created attachment 966161 [details]
new engine log
Still occurs - logs attached (new vdsm log and new engine log) Reproduced on another host 10.35.102.10 - Red Hat Enterprise Linux Server release 7.0 Please post cap files of both sides. Prefilter to port 54321. I tested on suggested env and the code failed to updateVMs but it tried to recover twice and the second attempt was successful. Created attachment 966759 [details]
engine log for export vm case
run explained in the comment, 2014-12-09 23:45 starts the export action
I've seen this issue in our 3.5 jobs for read only disks tests 1. create a vm 2. add 6 read only disks with different combinations (raw/cow, sparse/not, virtio/virtio-scsi) 3. export the vm using rest api UpdateVMVDSCommand fails with communication issues as state in previous comments. Carols it is the same issue and we already have fix for it. Yes, i saw. I think is the same issue but I prefer to upload a comment with logs just in case this bug is back to NEW because the fix is not correct (thus having more info). rhev 3.5.0 was released. closing. |