Bug 1169100 - UpdateVM sometimes fails using JsonRPC (request doesn't reach vdsm).
Summary: UpdateVM sometimes fails using JsonRPC (request doesn't reach vdsm).
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine
Version: 3.5.0
Hardware: x86_64
OS: Linux
high
urgent
Target Milestone: ---
: 3.5.0
Assignee: Piotr Kliczewski
QA Contact: Petr Matyáš
URL:
Whiteboard: infra
Depends On:
Blocks: 966103 rhev35rcblocker rhev35gablocker
TreeView+ depends on / blocked
 
Reported: 2014-11-30 10:07 UTC by Liron Aravot
Modified: 2016-02-10 19:14 UTC (History)
25 users (show)

Fixed In Version: org.ovirt.engine-root-3.5.0-25
Doc Type: Bug Fix
Doc Text:
Clone Of: 966103
Environment:
Last Closed: 2015-02-17 17:12:27 UTC
oVirt Team: Infra


Attachments (Terms of Use)
logs (631.12 KB, application/zip)
2014-12-01 09:26 UTC, Liron Aravot
no flags Details
logs and images (4.15 MB, application/x-gzip)
2014-12-02 08:10 UTC, lkuchlan
no flags Details
logs as debug level (3.42 MB, application/x-gzip)
2014-12-03 14:36 UTC, lkuchlan
no flags Details
logs and cap files (15.40 MB, application/x-gzip)
2014-12-04 15:38 UTC, Elad
no flags Details
screenshot (65.77 KB, image/png)
2014-12-04 18:33 UTC, Elad
no flags Details
cap files and logs (1) (3.98 MB, application/x-gzip)
2014-12-05 22:06 UTC, Elad
no flags Details
new vdsm log (2.59 MB, text/plain)
2014-12-09 09:30 UTC, Raz Tamir
no flags Details
new engine log (862.43 KB, text/plain)
2014-12-09 09:32 UTC, Raz Tamir
no flags Details
engine log for export vm case (2.05 MB, text/plain)
2014-12-10 12:10 UTC, Carlos Mestre González
no flags Details


Links
System ID Priority Status Summary Last Updated
oVirt gerrit 35938 master MERGED stomp: Make sure the \0 is a last char in the buffer Never
oVirt gerrit 36041 master MERGED send: wait on OS buff to empty Never
oVirt gerrit 36044 ovirt-3.5 MERGED send: wait on OS buff to empty Never
oVirt gerrit 36046 ovirt-engine-3.5 MERGED jsonrpc: version bump Never

Description Liron Aravot 2014-11-30 10:07:30 UTC
Opening this bug based on the logs provided by lkuchlan. (https://bugzilla.redhat.com/show_bug.cgi?id=966103#c20).
It can be seen that updateVms verb fails many time on timeout, while the request isn't even being logged in the vdsm log. It doesn't fail constantly as after many failures it eventually succeeds - therefore the problem seems to be a communication/transport issue. though there are many calls to UpdateVM - the first one logged in the vdsm log appears on 10:51.
Steps to reproduce attached above.
the issue seems to be the use of jsonRpc (that issue was never encountered using xmlrpc), adding needinfo? on lkuchlan to try to reproduce with xmlrpc.

I'm not sure if the problem is with how the engine sends the request or on vdsm side, opening on engine at the moment.

>> grep "UpdateVMV" engine.log
 ------------------------------------------------------------------
2014-11-30 10:26:08,900 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.UpdateVMVDSCommand] (org.ovirt.thread.pool-7-thread-31) FINISH, UpdateVMVDSCommand, log id: 5d1aed3
2014-11-30 10:26:18,701 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.UpdateVMVDSCommand] (org.ovirt.thread.pool-7-thread-22) START, UpdateVMVDSCommand( storagePoolId = 00000002-0002-0002-0002-0000000000b8, ignoreFailoverLimit = false, storageDomainId = bb5d6fa6-20f5-4746-8354-7fce9b52e003, infoDictionary.size = 1), log id: 414cfe06
2014-11-30 10:48:00,409 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.UpdateVMVDSCommand] (org.ovirt.thread.pool-7-thread-9) START, UpdateVMVDSCommand( storagePoolId = 00000002-0002-0002-0002-0000000000b8, ignoreFailoverLimit = false, storageDomainId = bb5d6fa6-20f5-4746-8354-7fce9b52e003, infoDictionary.size = 1), log id: 6bfb322f
2014-11-30 10:51:00,918 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.UpdateVMVDSCommand] (org.ovirt.thread.pool-7-thread-9) Command UpdateVMVDSCommand( storagePoolId = 00000002-0002-0002-0002-0000000000b8, ignoreFailoverLimit = false, storageDomainId = bb5d6fa6-20f5-4746-8354-7fce9b52e003, infoDictionary.size = 1) execution failed. Exception: IrsOperationFailedNoFailoverException: IRSGenericException: IRSErrorException: Message timeout which can be caused by communication issues
2014-11-30 10:51:00,966 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.UpdateVMVDSCommand] (org.ovirt.thread.pool-7-thread-9) FINISH, UpdateVMVDSCommand, log id: 6bfb322f
2014-11-30 10:51:09,309 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.UpdateVMVDSCommand] (org.ovirt.thread.pool-7-thread-49) START, UpdateVMVDSCommand( storagePoolId = 00000002-0002-0002-0002-0000000000b8, ignoreFailoverLimit = false, storageDomainId = bb5d6fa6-20f5-4746-8354-7fce9b52e003, infoDictionary.size = 1), log id: 1fa38320
2014-11-30 10:51:09,369 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.UpdateVMVDSCommand] (org.ovirt.thread.pool-7-thread-49) FINISH, UpdateVMVDSCommand, log id: 1fa38320
------------------------------------------------------------------

[laravot@dhcp-3-194 /]$ grep "updateVM" vdsm.log
Thread-28858::DEBUG::2014-11-30 10:51:14,066::__init__::467::jsonrpc.JsonRpcServer::(_serveRequest) Calling 'StoragePool.updateVMs' in bridge with {u'storagep



+++ This bug was initially created as a clone of Bug #966103 +++

Description of problem:

we actually have twp scenarios here which have a user impact: 

1. create a vm from template + add a snapsht -> export the vm without the template -> import the vm ->  UI: One of the templates cannot be found in the system, VM(s) cannot be imported

2. create a vm from template and run the vm -> LSM the vm -> shut down the vm -> export the vm -> remove the vm -> import the vm: Some imported VMs depend on one or more templates which are not available in the system. Therefore you must Import those VMs with 'collapse snapshots', another option is to Import missing templates first and then try import the VMs again

 
Version-Release number of selected component (if applicable):

sf17.2

How reproducible:

100%

Steps to Reproduce:
1. create a vm based on template 
2. create a snapshot
3. export the vm without the template
4. try to import the vm
5. export the template
6. try to export the vm

Actual results:

1. we cannot import the vm if the template does not exist in export domain although it exists in the setup. 
2. if the template exists in the export domain we still do not detect it in the setup and force the user to collapse and import the vm as an independent image. 

Expected results:

we should be able to detect the template and import the vm


--- Additional comment from lkuchlan on 2014-11-30 03:42:10 EST ---

Tested using RHEVM 3.5 vt11
After exporting VM based on template, the VM blocked on status image locked

Comment 1 Liron Aravot 2014-11-30 10:10:52 UTC
lkuchlan, let's try to reproduce the issue using xml rpc to be sure that it occures only when using json rpc just to be on the safe side. thanks.

this bug was opened to not block performing RCA on it, as the issue seems to be using json rpc (that never occurred before/on other versions).

Comment 2 Oved Ourfali 2014-11-30 10:23:43 UTC
Piotr - If this doesn't reproduce in XMLRPC, then please look at it in depth.
If it will reproduce in XMLRPC then it is a storage issue.

Comment 3 Liron Aravot 2014-11-30 10:25:24 UTC
forgot to mention, other calls succeeded during that time

2014-11-30 10:47:59,024 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.GetImageInfoVDSCommand] (org.ovirt.thread.pool-7-thread-9) [7006144] START, GetImageInfoVDSCommand( storagePoolId = 00000002-0002-0002-0002-0000000000b8, ignoreFailoverLimit = false, storageDomainId = 2b6bdbe2-bf61-4688-ad2f-2dcbe893ed64, imageGroupId = 1f2636a8-f1e3-4404-bd1a-db2ae77584f2, imageId = c76407ea-f19f-4896-b428-8007d0fa0b1f), log id: c46cca
2014-11-30 10:47:59,265 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.GetImageInfoVDSCommand] (org.ovirt.thread.pool-7-thread-9) [7006144] FINISH, GetImageInfoVDSCommand, return: org.ovirt.engine.core.common.businessentities.DiskImage@6b68ce48, log id: c46cca
2014-11-30 10:48:00,409 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.UpdateVMVDSCommand] (org.ovirt.thread.pool-7-thread-9) START, UpdateVMVDSCommand( storagePoolId = 00000002-0002-0002-0002-0000000000b8, ignoreFailoverLimit = false, storageDomainId = bb5d6fa6-20f5-4746-8354-7fce9b52e003, infoDictionary.size = 1), log id: 6bfb322f
2014-11-30 10:51:00,918 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.UpdateVMVDSCommand] (org.ovirt.thread.pool-7-thread-9) Command UpdateVMVDSCommand( storagePoolId = 00000002-0002-0002-0002-0000000000b8, ignoreFailoverLimit = false, storageDomainId = bb5d6fa6-20f5-4746-8354-7fce9b52e003, infoDictionary.size = 1) execution failed. Exception: IrsOperationFailedNoFailoverException: IRSGenericException: IRSErrorException: Message timeout which can be caused by communication issues

Comment 4 Allon Mureinik 2014-12-01 08:28:18 UTC
(In reply to Oved Ourfali from comment #2)
> Piotr - If this doesn't reproduce in XMLRPC, then please look at it in depth.
> If it will reproduce in XMLRPC then it is a storage issue.

Agreed.
If the initial analysis proves incorrect, please move it back.

Regardless of where the issue is (storage or infra), I'm flagging this bug as a potential blocker. If this it true, it's a regression in 3.5.0.

Comment 6 Piotr Kliczewski 2014-12-01 09:16:35 UTC
I checked the logs available in bug #966103 and I noticed that StoragePool.updateVMs seems to be not invoked on vdsm side. In the engine logs I can see that the engine is waiting for 3 mins without any activity and timeouts. 

I am not able to see whether the data was not sent or not received. Can you please set engine log level to debug, run the test one more time and attach the logs.

Comment 7 Liron Aravot 2014-12-01 09:26:06 UTC
Created attachment 963191 [details]
logs

Comment 8 Liron Aravot 2014-12-01 09:27:06 UTC
The attached logs are relevant the original logs (not at debug level).
lkuchlan, can you provide piotr with the needed logs as debug level?

thanks.

Comment 9 lkuchlan 2014-12-02 08:10:10 UTC
Created attachment 963582 [details]
logs and images

Liron, Please find attached the logs reproduced using XMLRPC and JSON.
With xml rpc it works, but with json rpc, after exporting VM based on a template, the VM gets blocked constantly with a status showing image is locked

Comment 10 Piotr Kliczewski 2014-12-02 15:46:58 UTC
Thank you for providing the logs but you use info log level.

Can you please set log level to debug?

I am not sure whether the data was not sent or not received.

Comment 11 Piotr Kliczewski 2014-12-03 11:00:37 UTC
I tried to follow the steps to reproduce the issue and here are the results:

I built latest 3.5 of the engine and vdsm. I run both on f20. I used jsonrpc.

Steps to Reproduce:
1. create a vm based on template => OK (based on CirrOS imported as template)
2. create a snapshot => OK
3. export the vm without the template => OK
4. try to import the vm => I saw import conflict dialogue and I changed the name of vm and I got: "Cannot import and collapse VM, Template is missing in domain export"
5. export the template => When I clicked OK I saw: "Cannot export Template. The Storage Domain already contains the target disk(s)."

After performing above operations I greped the logs and saw that UpdateVMVDSCommand was successful on the engine side and from vdsm side I saw:

Thread-634::DEBUG::2014-12-03 11:37:36,504::__init__::500::jsonrpc.JsonRpcServer::(_serveRequest) Return 'StoragePool.updateVMs' in bridge with True


Can you please suggest how to reproduce it?

Comment 12 Piotr Kliczewski 2014-12-03 11:23:13 UTC
Please retest with new build which should be available today.

Comment 13 lkuchlan 2014-12-03 14:36:59 UTC
Created attachment 964147 [details]
logs as debug level

Tested on RHEVM 3.5 vt12

Comment 14 Piotr Kliczewski 2014-12-03 16:22:37 UTC
Thank you for new logs. It looks like the data was sent to the other side.
In order to confirm can you please provide wireshark cap file?

Comment 15 Piotr Kliczewski 2014-12-04 11:24:36 UTC
I performed the same steps as in comment #11 with the difference that I installed host from qa-3.50-latest repo on rhel66. 

I had exactly the same results as in the comment.

Comment 16 Elad 2014-12-04 15:38:06 UTC
Created attachment 964727 [details]
logs and cap files

Reproduced:


2014-12-04 17:19:26,465 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.UpdateVMVDSCommand] (org.ovirt.thread.pool-7-thread-49) START, UpdateVMVDSCommand( storagePoolId = 00000002-0002-0002-0002-0000000002e1, ignoreFailoverLimit = false, storageDomainId = 05671944-6611-4331-ac4b-6330d1293132, infoDictionary.size = 1), log id: 5765f455
2014-12-04 17:20:05,068 INFO  [org.ovirt.engine.core.bll.tasks.AsyncTaskManager] (DefaultQuartzScheduler_Worker-72) Setting new tasks map. The map contains now 1 tasks
2014-12-04 17:20:07,390 INFO  [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-81) VM vm-7 bc3a1c58-3bd9-4f40-ae2e-648803bb6c1e moved from PoweringUp --> Up
2014-12-04 17:20:07,405 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-81) Correlation ID: 47dc1641, Job ID: f9f49b55-b659-4303-8a83-ffac7d514e16, Call Stack: null, Custom Event ID: -1, Message: VM vm-7 started on Host camel-vdsb
2014-12-04 17:22:26,833 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.UpdateVMVDSCommand] (org.ovirt.thread.pool-7-thread-49) Command UpdateVMVDSCommand( storagePoolId = 00000002-0002-0002-0002-0000000002e1, ignoreFailoverLimit = false, storageDomainId = 05671944-6611-4331-ac4b-6330d1293132, infoDictionary.size = 1) execution failed. Exception: IrsOperationFailedNoFailoverException: IRSGenericException: IRSErrorException: Message timeout which can be caused by communication issues
2014-12-04 17:22:26,833 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.UpdateVMVDSCommand] (org.ovirt.thread.pool-7-thread-49) FINISH, UpdateVMVDSCommand, log id: 5765f455
2014-12-04 17:22:26,850 INFO  [org.ovirt.engine.core.utils.transaction.TransactionSupport] (org.ovirt.thread.pool-7-thread-49) transaction rolled back
2014-12-04 17:22:26,860 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-7-thread-49) Correlation ID: 5efdd286, Call Stack: null, Custom Event ID: -1, Message: Failed to export Vm vm-6 to export
2014-12-04 17:22:26,860 ERROR [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (org.ovirt.thread.pool-7-thread-49) [within thread]: endAction for action type ExportVm threw an exception.
2014-12-04 17:22:26,860 ERROR [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (org.ovirt.thread.pool-7-thread-49) org.ovirt.engine.core.common.errors.VdcBLLException: VdcBLLException: org.ovirt.engine.core.vdsbroker.irsbroker.IrsOperationFailedNoFailoverException: IRSGenericException: IRSErrorException: Message timeout which can be caused by communication issues (Failed with error VDS_NETWORK_ERROR and code 5022)
2014-12-04 17:22:26,860 INFO  [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (org.ovirt.thread.pool-7-thread-49) CommandAsyncTask::HandleEndActionResult [within thread]: endAction for action type ExportVm completed, handling the result.



Performed scenario #2 until the export pahse:

 create a vm from template and run the vm -> LSM the vm -> shut down the vm -> export the vm


logs and cap files attached

Comment 17 Elad 2014-12-04 18:33:33 UTC
Created attachment 964785 [details]
screenshot

Please look at the attached screenshot.
I'm not familiar with such behaviour.

Comment 18 Piotr Kliczewski 2014-12-04 19:41:08 UTC
Which version of wireshark are you using I am using 1.10.11 and I am not able to open the files.

Comment 19 Piotr Kliczewski 2014-12-04 19:42:07 UTC
I missed LSM step in my testing so will try to reproduce it one more time.

Comment 20 Piotr Kliczewski 2014-12-05 08:17:30 UTC
I checked the log files and tcpdumps and it looks like the engine sends the message:

17:22:36.727627 IP vm-162-38.scl.lab.tlv.redhat.com.54429 > 10.35.116.2.54321: Flags [.], seq 2781:5677, ack 4415, win 205, options [nop,nop,TS val 1839070084 ecr 6681229], length 2896
17:22:36.727668 IP vm-162-38.scl.lab.tlv.redhat.com.54429 > 10.35.116.2.54321: Flags [.], seq 5677:8573, ack 4415, win 205, options [nop,nop,TS val 1839070084 ecr 6681229], length 2896
17:22:36.727676 IP vm-162-38.scl.lab.tlv.redhat.com.54429 > 10.35.116.2.54321: Flags [.], seq 8573:11469, ack 4415, win 205, options [nop,nop,TS val 1839070084 ecr 6681229], length 2896
17:22:36.727686 IP vm-162-38.scl.lab.tlv.redhat.com.54429 > 10.35.116.2.54321: Flags [.], seq 11469:14365, ack 4415, win 205, options [nop,nop,TS val 1839070084 ecr 6681229], length 2896
17:22:36.727694 IP vm-162-38.scl.lab.tlv.redhat.com.54429 > 10.35.116.2.54321: Flags [.], seq 14365:17261, ack 4415, win 205, options [nop,nop,TS val 1839070084 ecr 6681229], length 2896
17:22:36.728297 IP 10.35.116.2.54321 > vm-162-38.scl.lab.tlv.redhat.com.54429: Flags [.], ack 17261, win 249, options [nop,nop,TS val 6681405 ecr 1839070084], length 0

Comment 21 Piotr Kliczewski 2014-12-05 08:47:30 UTC
I run following steps:

 create a vm from template and run the vm -> LSM the vm -> shut down the vm -> export the vm

using engine on f20 and host on f20. The test was successful.

Comment 22 Oved Ourfali 2014-12-05 08:55:44 UTC
(In reply to Piotr Kliczewski from comment #21)
> I run following steps:
> 
>  create a vm from template and run the vm -> LSM the vm -> shut down the vm
> -> export the vm
> 
> using engine on f20 and host on f20. The test was successful.

I guess you're already on it, but in case you're not, I'd try rhel as well.

Comment 23 Piotr Kliczewski 2014-12-05 12:05:17 UTC
I attempted to reproduce it on lkuchlan environment but I was not able to perform LSM. When I offline move there was no issuse.

I started to analyze logs provided by Elad and noticed that every time UpdateVMVDSCommand is run. There is following line in vdsm.log

Thread-1909::DEBUG::2014-12-04 17:22:39,441::libvirtconnection::143::root::(wrapper) Unknown libvirterror: ecode: 80 edom: 20 level: 2 message: metadata not found: Requested metadata element is not present

I have seen that libvirt was able to make vdsm not responsive.

Comment 24 Piotr Kliczewski 2014-12-05 13:55:57 UTC
I started to play with reproductor on Elad's env. I modified betterAsyncore.py with recv method and I am able to see message on vdms side:

JsonRpc (StompReactor)::DEBUG::2014-12-05 15:41:15,104::betterAsyncore::289::root::(recv) SEND
destination:/queue/_local/vdsm/requests
content-length:23417

{"jsonrpc":"2.0","method":"StoragePool.updateVMs","params":{"sto

JsonRpc (StompReactor)::DEBUG::2014-12-05 15:41:15,107::betterAsyncore::289::root::(recv) :size=\"10

JsonRpc (StompReactor)::DEBUG::2014-12-05 15:41:15,110::betterAsyncore::289::root::(recv) None

The message seems to be not complete and is never emitted for processing.

Comment 25 Elad 2014-12-05 22:06:06 UTC
Created attachment 965251 [details]
cap files and logs (1)

Uploading the correct cap files, along with the logs from engine and vdsm.

2014-12-05 23:08:02,760 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.UpdateVMVDSCommand] (org.ovirt.thread.pool-7-thread-33) Command UpdateVMVDSCommand( storagePoolId = 0000
0002-0002-0002-0002-0000000002e1, ignoreFailoverLimit = false, storageDomainId = 05671944-6611-4331-ac4b-6330d1293132, infoDictionary.size = 1) execution failed. Exception: IrsO
perationFailedNoFailoverException: IRSGenericException: IRSErrorException: Message timeout which can be caused by communication issues

Comment 26 Elad 2014-12-07 08:52:50 UTC
Removing the need-info

Comment 30 Raz Tamir 2014-12-09 09:30:51 UTC
Created attachment 966159 [details]
new vdsm log

Comment 31 Raz Tamir 2014-12-09 09:32:28 UTC
Created attachment 966161 [details]
new engine log

Comment 32 Raz Tamir 2014-12-09 09:33:40 UTC
Still occurs - logs attached (new vdsm log and new engine log)

Comment 34 Raz Tamir 2014-12-09 09:42:30 UTC
Reproduced on another host 10.35.102.10 - Red Hat Enterprise Linux Server release 7.0

Comment 35 Saggi Mizrahi 2014-12-09 09:48:34 UTC
Please post cap files of both sides.
Prefilter to port 54321.

Comment 36 Piotr Kliczewski 2014-12-09 12:46:50 UTC
I tested on suggested env and the code failed to updateVMs but it tried to recover twice and the second attempt was successful.

Comment 37 Carlos Mestre González 2014-12-10 12:10:15 UTC
Created attachment 966759 [details]
engine log for export vm case

run explained in the comment, 2014-12-09 23:45 starts the export action

Comment 38 Carlos Mestre González 2014-12-10 12:12:26 UTC
I've seen this issue in our 3.5 jobs for read only disks tests

1. create a vm
2. add 6 read only disks with different combinations (raw/cow, sparse/not, virtio/virtio-scsi)
3. export the vm using rest api

UpdateVMVDSCommand fails with communication issues as state in previous comments.

Comment 39 Piotr Kliczewski 2014-12-10 12:26:52 UTC
Carols it is the same issue and we already have fix for it.

Comment 40 Carlos Mestre González 2014-12-10 12:34:44 UTC
Yes, i saw. I think is the same issue but I prefer to upload a comment with logs just in case this bug is back to NEW because the fix is not correct (thus having more info).

Comment 41 Eyal Edri 2015-02-17 17:12:27 UTC
rhev 3.5.0 was released. closing.


Note You need to log in before you can comment on or make changes to this bug.