624252 – Remote driver is not flushing event stream correctly, resulting in delayed event notification to applications

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 624252 - Remote driver is not flushing event stream correctly, resulting in delayed event notification to applications

Summary: Remote driver is not flushing event stream correctly, resulting in delayed ev...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	libvirt
Sub Component:
Version:	6.1
Hardware:	All
OS:	Linux
Priority:	low
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Eric Blake
QA Contact:	Virtualization Bugs
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	667309 (view as bug list)
Depends On:
Blocks:	666158 670949
TreeView+	depends on / blocked

Reported:	2010-08-15 08:10 UTC by Haim
Modified:	2014-01-13 00:46 UTC (History)
CC List:	27 users (show)
Fixed In Version:	libvirt-0.8.7-14.el6
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2011-05-19 13:20:08 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
logs (4.58 MB, application/x-gzip) 2010-08-15 08:10 UTC, Haim	no flags	Details
vdsm and libvirt log after 'CONT' event (2.74 MB, application/x-gzip) 2010-08-17 06:36 UTC, Haim	no flags	Details
log libvirt events (558 bytes, text/x-python) 2010-11-29 12:11 UTC, Dan Kenigsberg	no flags	Details
required logs. (693.19 KB, application/x-gzip) 2010-12-02 12:47 UTC, Haim	no flags	Details
vdsm-captured libvirt events (16.85 KB, application/gzipped-tar) 2010-12-05 16:52 UTC, Dan Kenigsberg	no flags	Details
libvirt, small.py and vdsm logs (5.89 KB, application/gzipped-tar) 2010-12-28 13:54 UTC, Dan Kenigsberg	no flags	Details
Ugly script reproducing the issue without vdsm (3.41 KB, text/plain) 2011-01-05 18:54 UTC, Cole Robinson	no flags	Details
Updated reproducer script (5.01 KB, text/plain) 2011-01-26 23:15 UTC, Cole Robinson	no flags	Details
full libvirtd log. (12.70 MB, application/x-gzip) 2011-02-23 07:48 UTC, David Naori	no flags	Details
libvirt client+server logs (146.75 KB, text/plain) 2011-03-01 08:59 UTC, Dan Kenigsberg	no flags	Details
libvirt client+server logs (146.75 KB, text/plain) 2011-03-09 09:19 UTC, Dan Kenigsberg	no flags	Details
libvirt client+server logs (411.15 KB, text/plain) 2011-03-09 16:42 UTC, Dan Kenigsberg	no flags	Details
libvirt client+server logs 10 sec delay (135.10 KB, text/plain) 2011-03-10 19:43 UTC, Dan Kenigsberg	no flags	Details
libvirt client+server logs, 10 sec delay, libvirtev debug (529.55 KB, text/plain) 2011-03-13 11:16 UTC, Dan Kenigsberg	no flags	Details
Show Obsolete (5) View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2011:0596	0	normal	SHIPPED_LIVE	libvirt bug fix and enhancement update	2011-05-18 17:56:36 UTC

Description Haim 2010-08-15 08:10:34 UTC

Created attachment 438810 [details]
logs

Description of problem:

hit situation were guest machine is reported as 'up' on vdsm and 'paused' on virsh. 
from reviewing the logs (libvirtd.log and vdsm), it seems that libvirt produced a set of events, but in different order , as we see that EIO event, then 'pause', then 'resume' and then some 'eother' events. this state probably confuses vdsm, which results the current state, or, it misses some events.


2.6.32-59.1.el6.x86_64
libvirt-0.8.1-23.el6.x86_64
vdsm-4.9-12.2.x86_64
device-mapper-multipath-0.4.9-25.el6.x86_64
lvm2-2.02.72-4.el6.x86_64
qemu-kvm-0.12.1.2-2.109.el6.x86_64

9:47:53.056: debug : qemuMonitorJSONIOProcessLine:115 : Line [{"timestamp": {"seconds": 1281854873, "microseconds": 56767}, "event": "BLOCK_IO_ERROR", "data": {"device": "drive-virtio-disk1", "__com.redhat_reason": "eother", "operation": "write", "action": "stop"}}]
09:47:53.056: debug : virJSONValueFromString:932 : string={"timestamp": {"seconds": 1281854873, "microseconds": 56767}, "event": "BLOCK_IO_ERROR", "data": {"device": "drive-virtio-disk1", "__com.redhat_reason": "eother", "operation": "write", "action": "stop"}}
09:47:53.058: debug : remoteRelayDomainEventLifecycle:118 : Relaying domain lifecycle event 3 2
09:47:53.067: debug : qemuMonitorJSONIOProcessLine:115 : Line [{"timestamp": {"seconds": 1281854873, "microseconds": 67664}, "event": "STOP"}]
09:47:53.067: debug : virJSONValueFromString:932 : string={"timestamp": {"seconds": 1281854873, "microseconds": 67664}, "event": "STOP"}
09:47:53.068: debug : qemuMonitorJSONIOProcessLine:115 : Line [{"timestamp": {"seconds": 1281854873, "microseconds": 67838}, "event": "BLOCK_IO_ERROR", "data": {"device": "drive-virtio-disk1", "__com.redhat_reason": "eother", "operation": "write", "action": "stop"}}]
09:47:53.068: debug : virJSONValueFromString:932 : string={"timestamp": {"seconds": 1281854873, "microseconds": 67838}, "event": "BLOCK_IO_ERROR", "data": {"device": "drive-virtio-disk1", "__com.redhat_reason": "eother", "operation": "write", "action": "stop"}}
09:47:53.074: debug : qemuMonitorJSONIOProcessLine:115 : Line [{"timestamp": {"seconds": 1281854873, "microseconds": 67975}, "event": "BLOCK_IO_ERROR", "data": {"device": "drive-virtio-disk1", "__com.redhat_reason": "eother", "operation": "write", "action": "stop"}}]
09:47:53.074: debug : virJSONValueFromString:932 : string={"timestamp": {"seconds": 1281854873, "microseconds": 67975}, "event": "BLOCK_IO_ERROR", "data": {"device": "drive-virtio-disk1", "__com.redhat_reason": "eother", "operation": "write", "action": "stop"}}
09:47:53.075: debug : qemuMonitorJSONIOProcessLine:115 : Line [{"timestamp": {"seconds": 1281854873, "microseconds": 68109}, "event": "BLOCK_IO_ERROR", "data": {"device": "drive-virtio-disk1", "__com.redhat_reason": "eother", "operation": "write", "action": "stop"}}]
09:47:53.075: debug : virJSONValueFromString:932 : string={"timestamp": {"seconds": 1281854873, "microseconds": 68109}, "event": "BLOCK_IO_ERROR", "data": {"device": "drive-virtio-disk1", "__com.redhat_reason": "eother", "operation": "write", "action": "stop"}}
09:47:53.075: debug : qemuMonitorJSONIOProcessLine:115 : Line [{"timestamp": {"seconds": 1281854873, "microseconds": 68261}, "event": "BLOCK_IO_ERROR", "data": {"device": "drive-virtio-disk1", "__com.redhat_reason": "eother", "operation": "write", "action": "stop"}}]
09:47:53.075: debug : virJSONValueFromString:932 : string={"timestamp": {"seconds": 1281854873, "microseconds": 68261}, "event": "BLOCK_IO_ERROR", "data": {"device": "drive-virtio-disk1", "__com.redhat_reason": "eother", "operation": "write", "action": "stop"}}
09:47:53.075: debug : qemuMonitorJSONIOProcessLine:115 : Line [{"timestamp": {"seconds": 1281854873, "microseconds": 68396}, "event": "BLOCK_IO_ERROR", "data": {"device": "drive-virtio-disk1", "__com.redhat_reason": "eother", "operation": "write", "action": "stop"}}]
09:47:53.075: debug : virJSONValueFromString:932 : string={"timestamp": {"seconds": 1281854873, "microseconds": 68396}, "event": "BLOCK_IO_ERROR", "data": {"device": "drive-virtio-disk1", "__com.redhat_reason": "eother", "operation": "write", "action": "stop"}}
09:47:53.076: debug : qemuMonitorJSONIOProcessLine:115 : Line [{"timestamp": {"seconds": 1281854873, "microseconds": 68532}, "event": "BLOCK_IO_ERROR", "data": {"device": "drive-virtio-disk1", "__com.redhat_reason": "eother", "operation": "write", "action": "stop"}}]
09:47:53.076: debug : virJSONValueFromString:932 : string={"timestamp": {"seconds": 1281854873, "microseconds": 68532}, "event": "BLOCK_IO_ERROR", "data": {"device": "drive-virtio-disk1", "__com.redhat_reason": "eother", "operation": "write", "action": "stop"}}
09:47:54.131: debug : qemuMonitorJSONIOProcessLine:115 : Line [{"timestamp": {"seconds": 1281854874, "microseconds": 131650}, "event": "RESUME"}]
09:47:54.131: debug : virJSONValueFromString:932 : string={"timestamp": {"seconds": 1281854874, "microseconds": 131650}, "event": "RESUME"}
09:47:54.132: debug : remoteRelayDomainEventLifecycle:118 : Relaying domain lifecycle event 4 0
09:47:54.132: debug : qemuMonitorJSONIOProcessLine:115 : Line [{"timestamp": {"seconds": 1281854874, "microseconds": 132856}, "event": "BLOCK_IO_ERROR", "data": {"device": "drive-virtio-disk1", "__com.redhat_reason": "eother", "operation": "write", "action": "stop"}}]
09:47:54.132: debug : virJSONValueFromString:932 : string={"timestamp": {"seconds": 1281854874, "microseconds": 132856}, "event": "BLOCK_IO_ERROR", "data": {"device": "drive-virtio-disk1", "__com.redhat_reason": "eother", "operation": "write", "action": "stop"}}
09:47:54.137: debug : remoteRelayDomainEventLifecycle:118 : Relaying domain lifecycle event 3 2

Comment 2 RHEL Program Management 2010-08-15 08:38:25 UTC

This issue has been proposed when we are only considering blocker
issues in the current Red Hat Enterprise Linux release.

** If you would still like this issue considered for the current
release, ask your support representative to file as a blocker on
your behalf. Otherwise ask that it be considered for the next
Red Hat Enterprise Linux release. **

Comment 3 Daniel Berrangé 2010-08-16 11:01:01 UTC

When receiving a BLOCK_IO_ERROR from QEMU, libvirt will produce either one or two events.  It will always produce an IO error event first.  Then depending on the current VM state is 'running' and the BLOCK_IO_ERROR action field is 'stop', it may also produce a lifecycle event of 'SUSPENDED' with 'reason == IOERROR'.

Comment 5 Haim 2010-08-17 06:36:48 UTC

Created attachment 439059 [details]
vdsm and libvirt log after 'CONT' event

Danken - attached is the logs for both libvirt and vdsm contains the relevant info about STOP event that doesn't reach vdsm.

Comment 6 Daniel Berrangé 2010-08-17 10:03:00 UTC

I don't see any libvirt bug based on these logs

$ grep Relay Downloads/logs 
17:35:16.407: debug : remoteRelayDomainEventLifecycle:118 : Relaying domain lifecycle event 4 0
17:35:16.415: debug : remoteRelayDomainEventIOErrorReason:274 : Relaying domain io error boom-pam 39 /rhev/data-center/6acd4aff-334a-44e1-8370-048f1ba9962b/6c4717af-38c1-47ea-846d-e8ecea1cd633/images/efc58c43-2ba5-4ced-8791-77858226035f/f3866833-f0e0-4336-93db-0385c9e048d6 virtio-disk1 1 eother
17:35:16.415: debug : remoteRelayDomainEventLifecycle:118 : Relaying domain lifecycle event 3 2
17:35:16.418: debug : remoteRelayDomainEventIOErrorReason:274 : Relaying domain io error boom-pam 39 /rhev/data-center/6acd4aff-334a-44e1-8370-048f1ba9962b/6c4717af-38c1-47ea-846d-e8ecea1cd633/images/efc58c43-2ba5-4ced-8791-77858226035f/f3866833-f0e0-4336-93db-0385c9e048d6 virtio-disk1 1 eother
17:35:16.418: debug : remoteRelayDomainEventIOErrorReason:274 : Relaying domain io error boom-pam 39 /rhev/data-center/6acd4aff-334a-44e1-8370-048f1ba9962b/6c4717af-38c1-47ea-846d-e8ecea1cd633/images/efc58c43-2ba5-4ced-8791-77858226035f/f3866833-f0e0-4336-93db-0385c9e048d6 virtio-disk1 1 eother
17:35:16.418: debug : remoteRelayDomainEventIOErrorReason:274 : Relaying domain io error boom-pam 39 /rhev/data-center/6acd4aff-334a-44e1-8370-048f1ba9962b/6c4717af-38c1-47ea-846d-e8ecea1cd633/images/efc58c43-2ba5-4ced-8791-77858226035f/f3866833-f0e0-4336-93db-0385c9e048d6 virtio-disk1 1 eother

This shows that

1. VDSM resumed the guest
2. An IO error occurred
3. The guest was paused by libvirt due to IO error in step 2
4. Many more IO errors were received from further outstanding IO requests issued before the guest was paused on the first error.

Comment 7 Denise Dumas 2010-08-20 14:27:31 UTC

Based on comment 6, should this be closed notabug?

Comment 8 Haim 2010-08-22 05:55:26 UTC

no, it is a bug, however, its origin is not yet known, as libvirt produces an event (pause) though vdsm fail to receive it. 

Danken - should we change component ?

Comment 11 Haim 2010-09-02 14:03:29 UTC

just hit it again backwards on migration scenario: 

1) migrate vm from source to destination: 
2) vm is migrated to destination machine, however, it enters pause state in vdsm! 
3) seem that libvirt sends the RESUME event, though, its not reach vdsm for some reason.

Comment 15 Barak 2010-11-28 16:49:57 UTC

Moved to  Conditional nak on capacity

Comment 16 Dan Kenigsberg 2010-11-29 12:11:21 UTC

Created attachment 463469 [details]
log libvirt events

Haim, I would like to see if vdsm looses the event or if it happens a bit below it. Could you try to reproduce this issue with the attached script running

LIBVIRT_DEBUG=1 python /usr/share/vdsm/small.py 2> /tmp/events.log &


This should give us some information from the client side of libvirt; if the lost event shows there, than it's a pure vdsm issue.

Comment 17 Haim 2010-12-02 12:47:37 UTC

Created attachment 464240 [details]
required logs.

Dan, 

attached please find 3 logs: 

1) vdsm log 
2) libvirtd log 
3) script log (libvirt client). 

problem seem to reproduce on concurrent migration. 

hope you have all information. set the testblocker flag back on.

Comment 18 Dan Kenigsberg 2010-12-05 16:52:51 UTC

Created attachment 464862 [details]
vdsm-captured libvirt events

Dan, I suspect there's a problem in libvirt's examples/domain-events/events-python/event-test.py or in the way I use it in vdsm. I started up 20 VMs, and later pkilled them all from the command line.

libvirt sent 20 "Stopped" events, and all of them were caught by my small.py libvirt client. However, vdsm caught only 10[*] of the events, which is detrimental to vdsm's notion of reality. I attach vdsm log relating to these events. Could you tell anything from it?

[*] It's not always half of the events that are lost.

Comment 21 Cole Robinson 2010-12-21 19:50:21 UTC

If the simple.py script could be updated to activate vdsm debugging and print a timestamp when an event is fired, we could get debug output from libvirtd, vdsm, and simple.py and try to match timestamps to see the differences.

Please also keep the debug output in separate files, and preferably use the start/stop reproducer danken mentioned in c18. Thanks

Comment 22 Dan Kenigsberg 2010-12-28 13:54:57 UTC

Created attachment 470962 [details]
libvirt, small.py and vdsm logs

Cole, I attach the logs that you requested (I hope), though I do not know what you expect to see there.

I ran 10 domains, and later pkilled them all.
I see 20 events relayed from libvirtd (probably 10 per client). small.py got them all, but vdsm's callback _onLibvirtLifecycleEvent() is called only once.

Comment 23 Dan Kenigsberg 2010-12-28 14:03:09 UTC

Please take a closer look at libvirtev.py. I wonder if multithreaded, forking and complex client may confuse it.

Comment 24 Cole Robinson 2011-01-05 18:54:47 UTC

Created attachment 471922 [details]
Ugly script reproducing the issue without vdsm

The attached script reproduces this issue against libvirt without vdsm. To run:

cp libvirt.git/examples/domain-events/events-python/event-test.py libvirtev.py
sudo python simple-lostevents.py

The script launches a separate thread which performs domain lookups and stats polling in a loop. That alone though isn't enough to cause an issue, to trigger the problem I needed to stick a sleep in the event callback.

I still need to dig a bit more to understand what's happening here, the events code is all new to me.

Comment 25 Cole Robinson 2011-01-05 21:20:55 UTC

Okay, I think I understand what's happening.

After the remote driver runs an event callback, it unconditionally disables the loop timer, thinking it just flushed every queued event. This doesn't work correctly though if an event is queued while a callback is running.

The events actually weren't being lost, it's just that the event loop didn't think there was anything that needed to be dispatched. So all those 'lost events' should actually get re-triggered if you manually kick the loop by generating a new event (like creating a new guest).

The solution is to disable the dispatch timer _before_ we invoke any event callbacks. Events queued while a callback is running will properly reenable the timer.

This patch fixes things for me, and matches how it's done in the qemu driver at least:

diff --git a/src/remote/remote_driver.c b/src/remote/remote_driver.c
index ee2de4a..ea119c6 100644
--- a/src/remote/remote_driver.c
+++ b/src/remote/remote_driver.c
@@ -10557,9 +10557,9 @@ remoteDomainEventQueueFlush(int timer ATTRIBUTE_UNUSED, void 
     priv->domainEvents->count = 0;
     priv->domainEvents->events = NULL;
 
+    virEventUpdateTimeout(priv->eventFlushTimer, -1);
     virDomainEventQueueDispatch(&tempQueue, priv->callbackList,
                                 remoteDomainEventDispatchFunc, priv);
-    virEventUpdateTimeout(priv->eventFlushTimer, -1);
 
     /* Purge any deleted callbacks */
     virDomainEventCallbackListPurgeMarked(priv->callbackList);

Comment 29 Vivian Bian 2011-01-21 12:04:51 UTC

Tried following stuff , please help to have a look whether it is correct to verify the bug :

1. tried to reproduce the bug :
   libvirt-0.8.7-1.el6.x86_64.rpm

   a. create two guests with the same disk to generate the I/O error 
   b. wait for the suspend status for the guest 

   But I think I'm bad luck enough for the guests kept running all the time without turning paused . 

2. tried with the new version of libvirt libvirt-0.8.7-3.el6

   a. create two guests with the same disk to generate the I/O error 
   b. wait for the suspend status for the guest 

   No guest goes paused . 
 
   c. manual perform virsh suspend guest , the event-python.py could catch the suspend event at once , and also virsh resume guest could bring the guest to running successfully .

Comment 30 Dave Allan 2011-01-21 18:17:16 UTC

Vivian, please see comment #24 for a reproducer script.  Also, Daniel Berrange has a lot on his plate all the time, so please don't request needinfo from him unless you have a specific reason to think that he, and no one else, has the information you need.  If you don't know who has the information you need, ask me.

Comment 31 Vivian Bian 2011-01-24 07:16:44 UTC

Tried following stuff , 

1. tried to reproduce the bug :
   libvirt-0.8.7-1.el6.x86_64.rpm

   executed the reproducer script from comment #24, and got the following result :
# python reproducer 
Starting VMs
1295834870.67 test1 running (0, 0)
1295834871.27 test2 running (0, 0)
1295834871.85 test3 running (0, 0)
1295834872.43 test4 running (0, 0)
1295834873.02 test5 running (0, 0)
1295834873.60 test6 running (0, 0)
1295834874.17 test7 running (0, 0)
1295834874.77 test8 running (0, 0)
1295834875.36 test9 running (0, 0)
1295834876.18 test10 running (0, 0)
1295834876.81 test11 running (0, 0)
1295834877.46 test12 running (0, 0)
1295834878.06 test13 running (0, 0)
1295834878.46 test14 running (0, 0)
1295834878.98 test15 running (0, 0)
1295834879.53 test16 running (0, 0)
1295834880.00 test17 running (0, 0)
1295834880.44 test18 running (0, 0)
1295834880.88 test19 running (0, 0)
1295834881.36 test20 running (0, 0)
Destroying VMs
1295834887.69 test11 shutoff (0, 0) 1
1295834888.69 test1 shutoff (0, 0) 2
1295834889.69 test2 shutoff (0, 0) 3
1295834890.69 test3 shutoff (0, 0) 4
1295834891.69 test4 shutoff (0, 0) 5
1295834892.69 test5 shutoff (0, 0) 6
1295834893.69 test6 shutoff (0, 0) 7
1295834894.69 test7 shutoff (5, 0) 8
1295834895.69 test8 shutoff (0, 0) 9
1295834896.70 test9 shutoff (0, 0) 10
1295834897.70 test10 shutoff (0, 0) 11
1295834898.70 test12 shutoff (0, 0) 12
1295834899.70 test13 shutoff (0, 0) 13
1295834900.70 test14 shutoff (5, 0) 14
1295834901.70 test15 shutoff (0, 0) 15
1295834902.70 test16 shutoff (0, 0) 16
1295834903.70 test18 shutoff (0, 0) 17
1295834904.70 test19 shutoff (5, 0) 18

Not all of the event are caught , and during destroying vms , the events were caught slowly (about 3 seconds per event)


2. tried with the new version of libvirt libvirt-0.8.7-3.el6
 python reproducer 
Starting VMs
1295834870.67 test1 running (0, 0)
1295834871.27 test2 running (0, 0)
1295834871.85 test3 running (0, 0)
1295834872.43 test4 running (0, 0)
1295834873.02 test5 running (0, 0)
1295834873.60 test6 running (0, 0)
1295834874.17 test7 running (0, 0)
1295834874.77 test8 running (0, 0)
1295834875.36 test9 running (0, 0)
1295834876.18 test10 running (0, 0)
1295834876.81 test11 running (0, 0)
1295834877.46 test12 running (0, 0)
1295834878.06 test13 running (0, 0)
1295834878.46 test14 running (0, 0)
1295834878.98 test15 running (0, 0)
1295834879.53 test16 running (0, 0)
1295834880.00 test17 running (0, 0)
1295834880.44 test18 running (0, 0)
1295834880.88 test19 running (0, 0)
1295834881.36 test20 running (0, 0)
Destroying VMs
1295834887.69 test11 shutoff (0, 0) 1
1295834888.69 test1 shutoff (0, 0) 2
1295834889.69 test2 shutoff (0, 0) 3
1295834890.69 test3 shutoff (0, 0) 4
1295834891.69 test4 shutoff (0, 0) 5
1295834892.69 test5 shutoff (0, 0) 6
1295834893.69 test6 shutoff (0, 0) 7
1295834894.69 test7 shutoff (5, 0) 8
1295834895.69 test8 shutoff (0, 0) 9
1295834896.70 test9 shutoff (0, 0) 10
1295834897.70 test10 shutoff (0, 0) 11
1295834898.70 test12 shutoff (0, 0) 12
1295834899.70 test13 shutoff (0, 0) 13
1295834900.70 test14 shutoff (5, 0) 14
1295834901.70 test15 shutoff (0, 0) 15
1295834902.70 test16 shutoff (0, 0) 16
1295834903.70 test18 shutoff (0, 0) 17
1295834904.70 test19 shutoff (5, 0) 18
1295834905.70 test20 shutoff (0, 0) 19
1295834906.71 test17 shutoff (5, 0) 20


all of the events could be caught immediately . So could set bug status to VERIFIED

Comment 32 David Naori 2011-01-24 18:56:24 UTC

Verified on libvirt 0.8.7-3 - seems like the issue has been solved.

Comment 33 Jiri Denemark 2011-01-24 21:06:05 UTC

*** Bug 667309 has been marked as a duplicate of this bug. ***

Comment 34 Dan Kenigsberg 2011-01-26 09:36:21 UTC

I'm afraid we still see this using vdsm.

We have started ~20 VMs and pkill'ed them all. libvirt immediately shows


10:46:23.264: 5882: debug : remoteRelayDomainEventLifecycle:136 : Relaying domain lifecycle event 5 0
10:46:23.264: 5882: debug : virDomainFree:2294 : dom=0x7f911c0540e0, (VM: name=ISO-attachcd4, uuid=e3a7dea7-6e19-4dfa-8d29-07acb3009a33), 
10:46:23.265: 5882: debug : remoteRelayDomainEventLifecycle:136 : Relaying domain lifecycle event 5 5
10:46:23.265: 5882: debug : virDomainFree:2294 : dom=0x7f911c0540e0, (VM: name=RHEL6-POOL-12, uuid=5782d90d-2be2-4bcc-91fa-8fa48e7bfbd3), 
10:46:23.265: 5882: debug : remoteRelayDomainEventLifecycle:136 : Relaying domain lifecycle event 5 5
10:46:23.265: 5882: debug : virDomainFree:2294 : dom=0x7f911c0540e0, (VM: name=RHEL6-POOL-28, uuid=ad687893-3013-4184-9ef5-b18236211991), 
10:46:23.265: 5882: debug : remoteRelayDomainEventLifecycle:136 : Relaying domain lifecycle event 5 0
10:46:23.265: 5882: debug : virDomainFree:2294 : dom=0x7f911c0540e0, (VM: name=RHEL6-POOL-33, uuid=e2a477c2-3a34-4ead-ac67-b795535d70d6), 
10:46:23.265: 5882: debug : remoteRelayDomainEventLifecycle:136 : Relaying domain lifecycle event 5 0
10:46:23.265: 5882: debug : virDomainFree:2294 : dom=0x7f911c0540e0, (VM: name=RHEL6-POOL-22, uuid=0fa8e7b4-583a-44c1-aa06-e20676d58519), 
10:46:23.265: 5882: debug : remoteRelayDomainEventLifecycle:136 : Relaying domain lifecycle event 5 0
10:46:23.265: 5882: debug : virDomainFree:2294 : dom=0x7f911c0540e0, (VM: name=RHEL6-POOL-03, uuid=83999f50-3be9-45f0-9d8b-eb33141c2d48), 
10:46:23.265: 5882: debug : remoteRelayDomainEventLifecycle:136 : Relaying domain lifecycle event 5 0
10:46:23.265: 5882: debug : virDomainFree:2294 : dom=0x7f911c0540e0, (VM: name=RHEL6-POOL-17, uuid=638de5c1-f3c8-4c29-9c38-f16adf6ce266), 
10:46:23.265: 5882: debug : remoteRelayDomainEventLifecycle:136 : Relaying domain lifecycle event 5 0
10:46:23.265: 5882: debug : virDomainFree:2294 : dom=0x7f911c0540e0, (VM: name=RHEL6-POOL-27, uuid=54925fc7-d040-46a5-a479-f8f1c751473a), 
10:46:23.265: 5882: debug : remoteRelayDomainEventLifecycle:136 : Relaying domain lifecycle event 5 5
10:46:23.266: 5882: debug : virDomainFree:2294 : dom=0x7f911c0540e0, (VM: name=RHEL6-POOL-15, uuid=df734d69-1135-4c87-861f-a3394bf54894), 
10:46:23.266: 5882: debug : remoteRelayDomainEventLifecycle:136 : Relaying domain lifecycle event 5 0
10:46:23.266: 5882: debug : virDomainFree:2294 : dom=0x7f911c0540e0, (VM: name=RHEL6-POOL-11, uuid=b81f23bd-574c-4b35-9c00-a95d292f980d), 
10:46:23.266: 5882: debug : remoteRelayDomainEventLifecycle:136 : Relaying domain lifecycle event 5 0
10:46:23.266: 5882: debug : virDomainFree:2294 : dom=0x7f911c0540e0, (VM: name=RHEL6-POOL-16, uuid=e4ca06e8-0ca5-43f1-9fb4-ad415f429d60), 
10:46:23.266: 5882: debug : remoteRelayDomainEventLifecycle:136 : Relaying domain lifecycle event 5 0
10:46:23.266: 5882: debug : virDomainFree:2294 : dom=0x7f911c0540e0, (VM: name=RHEL6-POOL-20, uuid=845cd71c-37a1-4ae1-baf7-7fbab7978045), 
10:46:23.266: 5882: debug : remoteRelayDomainEventLifecycle:136 : Relaying domain lifecycle event 5 0
10:46:23.266: 5882: debug : virDomainFree:2294 : dom=0x7f911c0540e0, (VM: name=RHEL6-POOL-23, uuid=d59ddda2-8f11-4b1d-b756-5e26c37e4d22), 
10:46:23.266: 5882: debug : remoteRelayDomainEventLifecycle:136 : Relaying domain lifecycle event 5 0
10:46:23.266: 5882: debug : virDomainFree:2294 : dom=0x7f911c0540e0, (VM: name=RHEL6-POOL-02, uuid=627098d9-8fc6-43eb-ad71-acf66b454c83), 
10:46:23.266: 5882: debug : remoteRelayDomainEventLifecycle:136 : Relaying domain lifecycle event 5 0
10:46:23.266: 5882: debug : virDomainFree:2294 : dom=0x7f911c0540e0, (VM: name=RHEL6-POOL-34, uuid=89f3bef3-acfb-4fbe-987e-701b2089c0af), 


However, these events do not reach vdsm's registered callback. Only much later, when vdsm issues libvirt.getCapabilities(), it is barraged by the lagging events:


libvirtEventLoop::DEBUG::2011-01-26 11:22:33,347::libvirtvm::1021::vds.vmlog.ad687893-3013-4184-9ef5-b18236211991::(_onLibvirtLifecycleEvent) event Stopped detail 5 opaque None
libvirtEventLoop::DEBUG::2011-01-26 11:22:33,583::libvirtvm::1021::vds.vmlog.e2a477c2-3a34-4ead-ac67-b795535d70d6::(_onLibvirtLifecycleEvent) event Stopped detail 0 opaque None
libvirtEventLoop::DEBUG::2011-01-26 11:22:33,826::libvirtvm::1021::vds.vmlog.0fa8e7b4-583a-44c1-aa06-e20676d58519::(_onLibvirtLifecycleEvent) event Stopped detail 0 opaque None
libvirtEventLoop::DEBUG::2011-01-26 11:22:34,063::libvirtvm::1021::vds.vmlog.83999f50-3be9-45f0-9d8b-eb33141c2d48::(_onLibvirtLifecycleEvent) event Stopped detail 0 opaque None
libvirtEventLoop::DEBUG::2011-01-26 11:22:34,307::libvirtvm::1021::vds.vmlog.638de5c1-f3c8-4c29-9c38-f16adf6ce266::(_onLibvirtLifecycleEvent) event Stopped detail 0 opaque None
libvirtEventLoop::DEBUG::2011-01-26 11:22:34,542::libvirtvm::1021::vds.vmlog.54925fc7-d040-46a5-a479-f8f1c751473a::(_onLibvirtLifecycleEvent) event Stopped detail 0 opaque None
libvirtEventLoop::DEBUG::2011-01-26 11:22:34,778::libvirtvm::1021::vds.vmlog.df734d69-1135-4c87-861f-a3394bf54894::(_onLibvirtLifecycleEvent) event Stopped detail 5 opaque None
libvirtEventLoop::DEBUG::2011-01-26 11:22:35,013::libvirtvm::1021::vds.vmlog.b81f23bd-574c-4b35-9c00-a95d292f980d::(_onLibvirtLifecycleEvent) event Stopped detail 0 opaque None
libvirtEventLoop::DEBUG::2011-01-26 11:22:35,260::libvirtvm::1021::vds.vmlog.e4ca06e8-0ca5-43f1-9fb4-ad415f429d60::(_onLibvirtLifecycleEvent) event Stopped detail 0 opaque None
libvirtEventLoop::DEBUG::2011-01-26 11:22:35,617::libvirtvm::1021::vds.vmlog.845cd71c-37a1-4ae1-baf7-7fbab7978045::(_onLibvirtLifecycleEvent) event Stopped detail 0 opaque None
libvirtEventLoop::DEBUG::2011-01-26 11:22:35,855::libvirtvm::1021::vds.vmlog.d59ddda2-8f11-4b1d-b756-5e26c37e4d22::(_onLibvirtLifecycleEvent) event Stopped detail 0 opaque None
libvirtEventLoop::DEBUG::2011-01-26 11:22:36,174::libvirtvm::1021::vds.vmlog.627098d9-8fc6-43eb-ad71-acf66b454c83::(_onLibvirtLifecycleEvent) event Stopped detail 0 opaque None
libvirtEventLoop::DEBUG::2011-01-26 11:22:36,406::libvirtvm::1021::vds.vmlog.89f3bef3-acfb-4fbe-987e-701b2089c0af::(_onLibvirtLifecycleEvent) event Stopped detail 0 opaque None

Comment 35 Cole Robinson 2011-01-26 23:11:06 UTC

danken, is it specifically getCapabilities, or any libvirt API call?

Now that I'm rerunning my script from comment #24 against current upstream, I'm still hitting issues. In fact, I'm getting lock ups in both the client and occasionally the daemon that is preventing me from making any progress debugging this. I still need to try against the RHEL6 tree.

Comment 36 Cole Robinson 2011-01-26 23:15:13 UTC

Created attachment 475494 [details]
Updated reproducer script

Updated script from comment #24. To use:

edit diskpath in the script to point to a real disk image (that you don't care about)
cp examples/domain-events/events-python/event-test.py libvirtev.py
sudo python new-lostevents.py

this regularly seems to lock the client. by the time we try to call getCapabilities at the end of the script, everything is stuck. can anyone else reproduce?

(also, I'm out most of tomorrow till monday, at which point I still have a lot of virt-manager/virtinst stuff on my plate, so if anyone wants to try and reproduce and possibly assist debugging I'd appreciate it).

Comment 37 Daniel Berrangé 2011-01-27 12:48:45 UTC

> this regularly seems to lock the client. by the time we try to call
> getCapabilities at the end of the script, everything is stuck. can anyone else
> reproduce?

This is caused by a bug in the python event loop implementation

The scenario is quite simple:

 - Thread A is running the event loop
 - Thread B is making random libvirt API calls all the time

When we receive the events from libvirtd about the VMs shutting down, they are immediately queued, to be dispatched from a timeout handler. So the Thread A runs the timeout handler and starts dispatching the 20 queued events. While this is going on, another libvirt API call arrives in thread B. In the reomte driver 'call' method we do

    if (priv->watch >= 0)
        virEventUpdateHandle(priv->watch, 0);

    rv = remoteIOEventLoop(conn, priv, flags, thiscall);

    if (priv->watch >= 0)
        virEventUpdateHandle(priv->watch, VIR_EVENT_HANDLE_READABLE);


The UpdateHandle calls, trigger the 'interrupt' method in the python event loop impl, which cause a 'write' to the self-pipe to make thread A wakeup.  At this point though, thread A is already awake and dealing with dispatching those 20 queued events.

Since you have a 'sleep(1)' in the event handler, dispatching those 20 events takes 20 seconds. During those 20 seconds, Thread A never reads from the self-pipe. This pipe has a limited buffer size, so with the huge number of libvirt API calls going on in thread B, the pipe buffer quickly becomes full, and further writes block waiting for some data to be read. Unfortunately when it blocks, Thread B is holding the lock on the remote driver. When Thread A tries to dispatch the next event it will try to grab the lock on the remote driver, but it can't get it. Thus Thread B is blocked on I/O on the self-pipe, and Thread A will never get around to draining the self-pipe because its waiting on a lock.

The flaw is the event loop impl. The virEventUpdateHandle() method must never be allowed to block in this manner.

Comment 38 Daniel Berrangé 2011-01-27 13:06:06 UTC

Apply the following patch to 'libvirtev' and that will solve the deadlocks

http://www.redhat.com/archives/libvir-list/2011-January/msg01131.html

With that applied I don't see any problems in event dispatch with Cole's demo program. All 20 events are received by remote driver and dispatched to the demo program, even with huge amounts of stuff going on in the background. I can only see the problem described by DanK in comment #34, if I remove the patch we've already applied to fix this.

Comment 39 Dave Allan 2011-02-14 14:38:13 UTC

Dan K, can you modify your code to include logic similar to that in comment 38 and see if it fixes the problem?

Comment 40 Dan Kenigsberg 2011-02-14 20:45:41 UTC

Dave, in bug 673111 I've applied dpb's patch. As you can see, vdsm is patched, verified, and happy with it. BTW, I would not mind if libvirtev.py is incorporated into the standard python binding module (in one way or another). But that's upstream-talk.

Comment 41 Dave Allan 2011-02-15 04:58:23 UTC

So where does that leave this BZ?  Is there anything further for QE to do to verify it?

Comment 42 Dan Kenigsberg 2011-02-22 15:19:42 UTC

Unfortunately, this bug is still with us:

libvirtd.log:
16:56:44.956: 20810: debug : remoteRelayDomainEventLifecycle:136 : Relaying domain lifecycle event 5 1
16:56:49.456: 20815: debug : virDomainCreateXML:2013 : conn=0x7ffb7c000a30, xmlDesc=<?xml version="1.0" ?>


vdsm.log
Thread-1252::DEBUG::2011-02-22 16:56:49,452::libvirtvm::753::vm.Vm::(_run) vmId=`9ffe28b6-6134-4b1e-8804-1185f49c436f`::<?xml version="1.0" ?> <domain type="kvm">
libvirtEventLoop::DEBUG::2011-02-22 16:56:49,453::libvirtvm::1037::vm.Vm::(_onLibvirtLifecycleEvent) vmId=`9ffe28b6-6134-4b1e-8804-1185f49c436f`::event Stopped detail 1 opaque None
libvirtEventLoop::DEBUG::2011-02-22 16:56:49,888::libvirtvm::1037::vm.Vm::(_onLibvirtLifecycleEvent) vmId=`9ffe28b6-6134-4b1e-8804-1185f49c436f`::event Started detail 0 opaque None

Note that the client received VIR_DOMAIN_EVENT_STOPPED 5 seconds after it was supposedly relayed, triggered by the following virDomainCreateXML().

Please contact dnaori who can reproduce this.

Comment 44 David Naori 2011-02-23 07:48:21 UTC

Created attachment 480368 [details]
full libvirtd log.

Comment 45 Daniel Berrangé 2011-02-23 15:40:32 UTC

Unfortunately that log files doesn't correspond to the time period of the data DanK pasted above. Also, we really need the logging from the libvirt client + libvirtd at the same time, since I believe this is a client side problem, rather than libvirtd.

Can you change your  libvirtd.conf logging to be 

  log_filters="1:qemu 3:json 1:util 1:security"

so that we get rid of the huge & useless virJSON log data.  And make sure that VDSM is run with these env vars set to enable libvirt client debugging: 

  LIBVIRT_LOG_FILTERS="1:qemu 3:json 1:util 1:security 3:json"
  LIBVIRT_LOG_OUTPUTS="1:file:/var/log/libvirt/vdsm.log"


Finally, install the following scratch build & restart libvirtd & vdsm. This has a patch to enable some extra verbose debug logs for events

  https://brewweb.devel.redhat.com/taskinfo?taskID=3135217

With all that done, try to reproduce the problem and then provide me the libvirt logs for vsdm + libvirtd.

Comment 47 Dan Kenigsberg 2011-03-01 08:59:05 UTC

Created attachment 481562 [details]
libvirt client+server logs

(In reply to comment #46)
> Could this be a case of testing a stale binary?
No. The system was rebooted yesterday to test a new kernel, after libvirt was last updated. Anyway, the problem reproduces when I run ruth...

I've edited /etc/init.d/vdsmd to set
  LIBVIRT_LOG_FILTERS="1:qemu 3:json 1:util 1:security"
  LIBVIRT_LOG_OUTPUTS="1:file:/var/log/vdsm/libvirt.log"
  LIBVIRT_DEBUG=1
(note changed location due to permissions, and the added LIBVIRT_DEBUG=1)

The aberrant event is logged by vdsm at
libvirtEventLoop::DEBUG::2011-03-01 10:46:10,820::libvirtvm::1040::vm.Vm::(_onLi
bvirtLifecycleEvent) vmId=`44896858-2177-41c6-914b-ed790b55ee89`::event Stopped 
detail 1 opaque None 

P.S., I would like vdsm to set libvirt's log_filters so that the logs are most helpful for you guys. Should vdsm always use the string requested in comment 45 for server- and client-side logging?

Comment 48 Daniel Berrangé 2011-03-01 10:48:11 UTC

> Should vdsm always use the string requested in comment 45 for server- and client-side logging?

Yes, the 'json' debug output is proving far too verbose & not very useful. So changing all your servers to hide the JSON logs as per comment #45 would be ideal.

Comment 49 Daniel Berrangé 2011-03-07 16:53:50 UTC

> Created attachment 481562 [details]
> libvirt client+server logs

You don't appear to have installed the custom libvirt build I requested in comment #45, so these logs don't have the extra debug info I require. Please install the scratch build & collect the logs for client/server with that present - the log file should confirm which build you have installed & running

eg:

10:45:44.323: 7056: info : libvirt version: 0.8.7, package: 7.el6 (Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>, 2011-02-17-11:07:22, hs20-bc2-5.build.redhat.com)


that should say  '7.el6.2dan' if you have the custom build I requested installed.

Comment 50 Dan Kenigsberg 2011-03-08 16:25:25 UTC

Pardon, but the scratch build https://brewweb.devel.redhat.com/taskinfo?taskID=3135217 has been garbage-collected. May we have a new one?

Comment 51 Daniel Berrangé 2011-03-08 16:37:23 UTC

New one here:

https://brewweb.devel.redhat.com/taskinfo?taskID=3163686

Comment 52 Dan Kenigsberg 2011-03-09 09:19:08 UTC

Created attachment 483130 [details]
libvirt client+server logs

vdsm receives the delayed event at

libvirtEventLoop::DEBUG::2011-03-09 09:59:21,135::libvirtvm::1043::vm.Vm::(_onLibvirtLifecycleEvent) vmId=`57701d41-e50a-4d73-a495-61ea4c0e0a84`::event St
opped detail 1 opaque None

Comment 53 Daniel Berrangé 2011-03-09 14:41:17 UTC

The log attached in Comment #52 appears to be identical to the one in Comment #49. Did you upload the wrong file ?

Comment 54 Dan Kenigsberg 2011-03-09 16:42:37 UTC

Created attachment 483258 [details]
libvirt client+server logs

(In reply to comment #53)
> The log attached in Comment #52 appears to be identical to the one in Comment
> #49. Did you upload the wrong file ?

of course. grr.

Comment 55 Daniel Berrangé 2011-03-09 16:53:22 UTC

What exactly leads you to suspect that the events are still delayed  ?  The latest logs show that the client side is now operating correctly, and immediately passing the events upto VDSM:

09:59:21.134: 28012: debug : processCallDispatchMessage:10000 : Event ready for queue 0x7fb004228ec0 0x7fb004012610
09:59:21.134: 27327: debug : remoteDomainEventQueueFlush:10668 : Event queue flush 0x7fb004012610
09:59:21.134: 27327: debug : remoteDomainEventDispatchFunc:10654 : Dispatch event 0x7fb004228ec0 0x7fb004012610


The first line shows the event having been read off the wire and placed on the dispatch queue. The second & third lines show the event loop processing the queue and sending the event onto VDSM.  The VDSM log file timestamp shows it getting it 1 millisecond later.

So I'm not clear what bug we're supposed to be seeing here ?

Comment 56 Dan Kenigsberg 2011-03-09 17:10:46 UTC

Which event is dispatched in 09:59:21.134? I suspect VIR_DOMAIN_EVENT_STARTED

09:59:21.134: 28012: debug : virDomainCreateXML:2013

but a microsecond later vdsm is notified about a much older VIR_DOMAIN_EVENT_STOPPED. VIR_DOMAIN_EVENT_STARTED is received much later, at 09:59:21,566.

Your analysis narrows our problem to the python-side event loop, isn't it? How would you suggest to debug it?

Comment 57 Daniel Berrangé 2011-03-10 16:22:54 UTC

The following extract of the client.log shows a high level sequence of events

09:59:17.470: 27680: debug : virDomainCreateXML:2013
09:59:17.970: 27680: debug : processCallDispatchMessage:10000
09:59:17.970: 27327: debug : remoteDomainEventDispatchFunc:10654
09:59:18.800: 27762: debug : virDomainDestroy:2251
09:59:18.991: 27327: debug : processCallDispatchMessage:10000
09:59:18.991: 27327: debug : remoteDomainEventDispatchFunc:10654
09:59:19.516: 27845: debug : virDomainCreateXML:2013
09:59:19.942: 27845: debug : processCallDispatchMessage:10000
09:59:19.942: 27327: debug : remoteDomainEventDispatchFunc:10654
09:59:20.644: 27927: debug : virDomainDestroy:2251
09:59:21.134: 28012: debug : virDomainCreateXML:2013
09:59:21.134: 28012: debug : processCallDispatchMessage:10000
09:59:21.134: 27327: debug : remoteDomainEventDispatchFunc:10654
09:59:21.566: 28012: debug : processCallDispatchMessage:10000
09:59:21.566: 27327: debug : remoteDomainEventDispatchFunc:10654
09:59:21.568: 28012: debug : virDomainDestroy:2251
09:59:21.727: 28012: debug : processCallDispatchMessage:10000
09:59:21.727: 27327: debug : remoteDomainEventDispatchFunc:10654

So, superficially it looks like there was an out-of-order / delayed event about half way through. ie, IIUC, at timestamps 09:59:20.644 / 09:59:21.134 you are expecting the 'destroy' event to have arrived & been processed before virDomainCreateXML runs.

This part of the logs shows an absolute maximum of 490ms delay.  If we look more closely at this region of the logs we can get better granularity on that delay.

09:59:20.644: 27927: debug : virDomainDestroy:2251 : dom=0x7faffc081560, (VM: name=vm1, uuid=57701d41-e50a-4d73-a495-61ea4c0e0a84), 
09:59:20.808: 27927: debug : remoteIODecodeMessageLength:9802 : Got length, now need 56 total (52 more)
09:59:21.134: 28012: debug : virDomainCreateXML:2013 : conn=0x7fb004012610, xmlDesc=<?xml version="1.0" encoding="utf-8"?>
09:59:21.134: 28012: debug : remoteIODecodeMessageLength:9802 : Got length, now need 64 total (60 more)
09:59:21.134: 28012: debug : processCallDispatchMessage:10000 : Event ready for queue 0x7fb004228ec0 0x7fb004012610
09:59:21.134: 27327: debug : remoteDomainEventQueueFlush:10668 : Event queue flush 0x7fb004012610
09:59:21.566: 28012: debug : remoteIODecodeMessageLength:9802 : Got length, now need 64 total (60 more)
09:59:21.566: 28012: debug : processCallDispatchMessage:10000 : Event ready for queue 0x7fb0042405d0 0x7fb004012610
09:59:21.566: 27327: debug : remoteDomainEventQueueFlush:10668 : Event queue flush 0x7fb004012610
09:59:21.566: 28012: debug : remoteIODecodeMessageLength:9802 : Got length, now need 84 total (80 more)

The virDomainDestroy starts at 09:59:20.644, but does not complete until  at least 09:59:20.808. The event from the destroy API call, in fact arrives the very moment we have finished sending the virDomainCreateXML call - before its reply appears. So we can see the actual delay on the event is max 326ms.

It is important to remember that the events are asynchronous, and dispatched from the event loop in an idle context. So given a pair of API calls: 

  virDomainDestroy
  virDomainCreateXML

On the wire, it is possible for any of the following sequences to occurr

  > domain_destroy call
  < domain_destroy reply
  < domain_destroy event
  > domain_create call
  > domain_create reply
  

  > domain_destroy call
  < domain_destroy event
  < domain_destroy reply
  > domain_create call
  > domain_create reply


  > domain_destroy call
  < domain_destroy reply
  > domain_create call
  < domain_destroy event
  > domain_create reply

Assuming, the destroy API call has completed on the client, before the create API call is started, the only guarantee is that the 'destroy' event will arrive on the wire *prior* to the create API *reply* or create *event*.

In the majority of the client.log from comment 54, either sequence 1 or 2 occurs. This is what you should expect most of the time.  The one apparently out-of-order event, is an instance of sequence 3 occurring, and thus not in fact an out-of-order delivery at all.

The fact that the maximum delay we are seeing for this one event is 300ms, means I'm not convinced there is any bug here. This delay is short enough that it can be explained by a combination of non-deterministic OS level thread scheduling, and  python's crap thread model. Python's thread model does not actually execute threads concurrently, each thread runs for 100 python instructions at a time, before potentially yielding to another thread. Even this is not fair though, it is quite likely that due to OS level schedular fun, the same python thread will run again immediately.

This discussion is of course still considering things at the wire level. I mentioned that events are dispatched to the application in the event loop thread. Thus, if an event arrives in the middle of the virDomainCreateXML call, it will be queued. Due to the non-deterministic nature of OS thread scheduling, it could also be possible for an application to not receive the event until after the virDomainCreateXML call completes. This is not the case here though - the event is arriving & being given to the app code, while virDomainCreateXML is still running.

So in summary, the logs don't demonstrate any continued problem. Certainly there is nothing anything like the 5 second event delays we saw prior to the earlier fixes in this bug being applied.

Comment 58 Dan Kenigsberg 2011-03-10 19:43:24 UTC

Created attachment 483547 [details]
libvirt client+server logs 10 sec delay

> Certainly there is nothing anything like the 5 second event delays we saw 
> prior to the earlier fixes in this bug being applied.

I am afraid that I see a 10 second delay when I add a 10 second sleep between destroy and the ensuing create. On 21:19:05.409 vdsm sees the STOP event that was generated by virDomainDestroy on 21:19:05.822.


> This delay is short enough that it can be explained by a combination of
> non-deterministic OS level thread scheduling, and  python's crap thread model

That's a second-order problem the we'd want to fix (I'd like to somehow set a barrier to make sure that all previously-issued events are dispatched. How would I do that?)

Comment 59 Dan Kenigsberg 2011-03-10 19:50:03 UTC

= correction =
On 21:19:05.409 vdsm sees the STOP event that was generated by virDomainDestroy on 21:18:54.911 (believe me, vdsm has no time machine submodule)

Comment 60 Daniel Berrangé 2011-03-11 11:35:11 UTC

Ok, with it showing a 10 second delay, that's clearly a bug. 

Can you enable debugging in the libvirt event loop impl used in VDSM code. ie

  eventLoop = virEventLoopPure(debug=True)

Also in libvirtd.conf, change 

  log_filters="1:qemu 3:json 1:util 1:security"

To

  log_filters="1:qemu 3:json 1:util 1:security 1:libvirt 1:remote"

and restart libvirtd & then capture the same set of logs as with comment #58.

Comment 61 Dan Kenigsberg 2011-03-13 11:16:09 UTC

Created attachment 483984 [details]
libvirt client+server logs, 10 sec delay, libvirtev debug

13:05:15.071 destroy
13:05:26.500 STOP event processed

libvirtev debug lines are embedded in vdsm.log

Comment 62 Daniel Berrangé 2011-03-14 11:04:26 UTC

Well those logs were quite useful in eliminating possibilities, but I still can't find the precise bug.

Could you try one thing though. Edit the libvirtd.conf and change 'auth_unix_rw' setting from 'sasl' to 'none' and then restart libvirtd+vdsm.  This will help us see if the problem is related to the way I/O works with SASL, or if its a more general libvirtd I/O problem.

Comment 63 Daniel Berrangé 2011-03-14 11:43:15 UTC

I've identified one potential problem, and I think turning off SASL auth would solve it, so it would be good to have confirmation of that.

Comment 64 Dan Kenigsberg 2011-03-14 14:16:45 UTC

(In reply to comment #63)
> I've identified one potential problem, and I think turning off SASL auth would
> solve it, so it would be good to have confirmation of that.

Indeed, with
   auth_unix_rw="none"
I cannot reproduce the bug.

Comment 65 Daniel Berrangé 2011-03-14 17:43:04 UTC

I have produced another scratch build with a possible fix

http://brewweb.devel.redhat.com/brew/taskinfo?taskID=3177151

can you install that RPM, re-enable SASL authentication in libvirtd, restart libvirtd & vdsm and see if it now works correctly.

Comment 66 Dan Kenigsberg 2011-03-15 07:07:09 UTC

(In reply to comment #65)
> I have produced another scratch build with a possible fix
> 
> http://brewweb.devel.redhat.com/brew/taskinfo?taskID=3177151
> 
> can you install that RPM, re-enable SASL authentication in libvirtd, restart
> libvirtd & vdsm and see if it now works correctly.

Looking good!

Comment 67 Daniel Berrangé 2011-03-18 17:01:52 UTC

Fix upstream

http://libvirt.org/git/?p=libvirt.git;a=commit;h=635523f74a81c6277f3a6d6f57b48f654ee4833a

Comment 68 Eric Blake 2011-03-18 18:03:29 UTC

In POST:
http://post-office.corp.redhat.com/archives/rhvirt-patches/2011-March/msg00455.html

Comment 70 Vivian Bian 2011-03-24 08:36:23 UTC

checked with 
libvirt-0.8.7-14.el6.x86_64
qemu-kvm-0.12.1.2-2.152.el6.x86_64
kernel-2.6.32-125.el6.x86_64
vdsm-4.9-55.el6.x86_64

[Steps]
1.  set following things in /etc/libvirt/libvirtd.conf

  log_filters="1:qemu 3:json 1:util 1:security"
  auth_unix_rw="sasl" # by vdsm

2. Set following things in /etc/init.d/vdsmd 

    LIBVIRT_LOG_FILTERS="1:qemu 3:json 1:util 1:security 3:json"  \
    LIBVIRT_LOG_OUTPUTS="1:file:/var/log/libvirt/vdsm.log" \
    LIBVIRT_DEBUG=1 \

3. service libvirtd restart
4. service vdsmd restart
5. ut/ruthAgent.py restart
6. ./ruth.py -l d -d /root/0.8.7-14.el6/vmTests.conf -f vmTests.SingleVmTests.repeatedCreation

 ./ruth.py -l d -d /root/0.8.7-14.el6/vmTests.conf -f vmTests.SingleVmTests.repeatedCreation

    Hello, nice to meet you. I am RUTH - "Regression and Unit Test Harness".
    I am going to run a comprehensive test suite in order to validate vdsm
    functionality. However, I may require some assistance from you in order
    to correctly bootstrap the whole procedure.
    Use --help to see what you can do with me.
    
Validating configuration file 'vmTests.conf'.
Loading RUTH configuration.
Running tests from modules: vmTests.
Module 'vmTests' is READY
Exercising module 'vmTests'
	vmTests.SingleVmTests.repeatedCreation: 
		00:21:24 DEBUG   Loading environment data
		00:21:24 INFO    Starting to build environment
		00:21:24 DEBUG   Connecting host 'client:http://10.66.85.203:54321'
		00:21:30 DEBUG   Making sure host 'client' is clean
		00:21:30 DEBUG   Validate connecting host 'client' to storage
		00:21:31 DEBUG   Connecting host 'client' to storage
		00:21:32 DEBUG   Creating domains for pool 'spUUID'
		00:21:32 DEBUG   Creating domain target for domain 'sdUUID' on '10.66.83.206:/data/ruth'
		00:21:32 DEBUG   Creating storage domain 'sdUUID:e81baa94-426f-47c8-9dfd-db6edb13ee95' version 0
		00:21:35 DEBUG   Creating pool 'spUUID:91bb6331-8b39-46cf-becf-800b504884c9' with master domain 'e81baa94-426f-47c8-9dfd-db6edb13ee95'
		00:21:41 DEBUG   Connecting pool 'spUUID'
		00:21:43 DEBUG   Starting SPM for pool 'spUUID'
		00:21:46 DEBUG   Activating domain 'sdUUID'
		00:21:49 DEBUG   Creating image 'imgUUID'
		00:21:49 DEBUG   Creating volume 'volUUID' of image 'c6ad17d3-381e-4d02-9930-6f5b6ff81c24'
		00:21:51 DEBUG   Creating image 'memImgUUID'
		00:21:51 DEBUG   Creating volume 'memVolUUID' of image 'b3a18104-d508-4f2b-9aed-c53a57f1390d'
		00:21:52 DEBUG   Creating image 'confImgUUID'
		00:21:52 DEBUG   Creating volume 'confVolUUID' of image 'de414891-5436-469f-b401-58c7aa3aa64d'
		00:21:54 DEBUG   Finished processing host 'client'
		00:21:54 INFO    Finished building environment
		00:21:55 DEBUG   Vm eda30a2a-d819-46cb-bb5c-2975ab8f4166 is Powering up
-------- Powering up
		00:21:55 DEBUG   Vm eda30a2a-d819-46cb-bb5c-2975ab8f4166 is Up
		00:21:56 DEBUG   Vm eda30a2a-d819-46cb-bb5c-2975ab8f4166 is Powering up
-------- Powering up
		00:21:56 DEBUG   Vm eda30a2a-d819-46cb-bb5c-2975ab8f4166 is Up
		00:21:58 DEBUG   Vm eda30a2a-d819-46cb-bb5c-2975ab8f4166 is Powering up
-------- Powering up
		00:21:58 DEBUG   Vm eda30a2a-d819-46cb-bb5c-2975ab8f4166 is Up
		00:21:59 DEBUG   Vm eda30a2a-d819-46cb-bb5c-2975ab8f4166 is Powering up
-------- Powering up
		00:21:59 DEBUG   Vm eda30a2a-d819-46cb-bb5c-2975ab8f4166 is Up
		00:22:01 DEBUG   Vm eda30a2a-d819-46cb-bb5c-2975ab8f4166 is Powering up
-------- Powering up
		00:22:01 DEBUG   Vm eda30a2a-d819-46cb-bb5c-2975ab8f4166 is Up
		00:22:02 DEBUG   Vm eda30a2a-d819-46cb-bb5c-2975ab8f4166 is Powering up
-------- Powering up
		00:22:02 DEBUG   Vm eda30a2a-d819-46cb-bb5c-2975ab8f4166 is Up
		00:22:04 DEBUG   Vm eda30a2a-d819-46cb-bb5c-2975ab8f4166 is Powering up
-------- Powering up
		00:22:04 DEBUG   Vm eda30a2a-d819-46cb-bb5c-2975ab8f4166 is Up
		00:22:05 DEBUG   Vm eda30a2a-d819-46cb-bb5c-2975ab8f4166 is Powering up
-------- Powering up
		00:22:05 DEBUG   Vm eda30a2a-d819-46cb-bb5c-2975ab8f4166 is Up
		00:22:07 DEBUG   Vm eda30a2a-d819-46cb-bb5c-2975ab8f4166 is Powering up
-------- Powering up
		00:22:07 DEBUG   Vm eda30a2a-d819-46cb-bb5c-2975ab8f4166 is Up
		00:22:08 DEBUG   Vm eda30a2a-d819-46cb-bb5c-2975ab8f4166 is Powering up
-------- Powering up
		00:22:08 DEBUG   Vm eda30a2a-d819-46cb-bb5c-2975ab8f4166 is Up
		00:22:10 DEBUG   Vm eda30a2a-d819-46cb-bb5c-2975ab8f4166 is Powering up
-------- Powering up
		00:22:10 DEBUG   Vm eda30a2a-d819-46cb-bb5c-2975ab8f4166 is Up
		00:22:10 INFO    Starting clean up
		00:22:10 DEBUG   Releasing pools
		00:22:13 DEBUG   Cleaning pool '91bb6331-8b39-46cf-becf-800b504884c9'
		00:22:13 DEBUG   Deactivating domain 'e81baa94-426f-47c8-9dfd-db6edb13ee95'
		00:22:19 DEBUG   Deleting domains
		00:22:19 DEBUG   Deleting domain 'e81baa94-426f-47c8-9dfd-db6edb13ee95'
		00:22:19 DEBUG   Formatting domain 'e81baa94-426f-47c8-9dfd-db6edb13ee95'
		00:22:20 DEBUG   Destroying domain targets
		00:22:20 DEBUG   Disconnecting from storage
		00:22:20 INFO    Finished clean up
	Result: OK
*************
* All Good! *
*************
All Done!
remember:
	"To Err is Human, To Test is Divine!"


All Good for vmTests.SingleVmTests.repeatedCreation . So set bug status to VERIFIED

Comment 71 Nan Zhang 2011-04-20 11:08:42 UTC

Running RUTH test with the following builds, the result is OK. So moving to VERIFIED.

qemu-kvm-0.12.1.2-2.153.el6.x86_64
libvirt-0.8.7-18.el6.x86_64
vdsm-4.9-60.el6.x86_64


vmTests.SingleVmTests.repeatedCreation: OK

Comment 74 errata-xmlrpc 2011-05-19 13:20:08 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-0596.html

Note You need to log in before you can comment on or make changes to this bug.

abaron
acathrow
bazulay
berrange
crobinso
dallan
danken
ddumas
dnaori
dyuan
eblake
hateya
hbrock
iheim
imansano
jwest
mgoldboi
mjenner
nzhang
Rhev-m-bugs
syeghiay
vbian
veillard
xen-maint
yeylon
ykaul
yoyzhang