RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 743817 - [libvirt][events] sometimes after migrating a large number of domains, Events does not arrive to the client side.
Summary: [libvirt][events] sometimes after migrating a large number of domains, Events...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: libvirt
Version: 6.4
Hardware: x86_64
OS: Linux
high
high
Target Milestone: rc
: ---
Assignee: Michal Privoznik
QA Contact: Virtualization Bugs
URL:
Whiteboard:
Depends On:
Blocks: 748554
TreeView+ depends on / blocked
 
Reported: 2011-10-06 07:52 UTC by David Naori
Modified: 2011-12-06 11:35 UTC (History)
15 users (show)

Fixed In Version: libvirt-0.9.4-18.el6
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2011-12-06 11:35:04 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
libvirtd.log libvirt.log vdsm.log (2.04 MB, application/x-xz)
2011-10-06 07:52 UTC, David Naori
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2011:1513 0 normal SHIPPED_LIVE libvirt bug fix and enhancement update 2011-12-06 01:23:30 UTC

Description David Naori 2011-10-06 07:52:18 UTC
Created attachment 526641 [details]
libvirtd.log libvirt.log vdsm.log

Description of problem:

Migrated ~50 domains via vdsm, the migrations finish up successfully, but the client side after getting a large number of events (VIR_DOMAIN_EVENT_RESUMED_MIGRATED, VIR_DOMAIN_EVENT_STARTED_MIGRATED) it stopped getting **any** events from libvirtd.

(vdsm reporting paused vms while actually up)

libvirtd Relaying events (libvirtd.log):
 
19:04:24.478: 8856: debug : remoteRelayDomainEventLifecycle:129 : Relaying domain lifecycle event 2 1
19:04:26.520: 8856: debug : remoteRelayDomainEventLifecycle:129 : Relaying domain lifecycle event 2 1
19:04:26.536: 8856: debug : remoteRelayDomainEventLifecycle:129 : Relaying domain lifecycle event 2 1
19:04:26.736: 8856: debug : remoteRelayDomainEventLifecycle:129 : Relaying domain lifecycle event 2 1
19:04:28.022: 8856: debug : remoteRelayDomainEventLifecycle:129 : Relaying domain lifecycle event 4 1
19:04:30.019: 8856: debug : remoteRelayDomainEventLifecycle:129 : Relaying domain lifecycle event 2 1
19:04:37.952: 8856: debug : remoteRelayDomainEventLifecycle:129 : Relaying domain lifecycle event 4 1
19:04:39.025: 8856: debug : remoteRelayDomainEventLifecycle:129 : Relaying domain lifecycle event 4 1
19:04:39.474: 8856: debug : remoteRelayDomainEventLifecycle:129 : Relaying domain lifecycle event 4 1
19:04:40.145: 8856: debug : remoteRelayDomainEventLifecycle:129 : Relaying domain lifecycle event 4 1
19:04:40.368: 8856: debug : remoteRelayDomainEventLifecycle:129 : Relaying domain lifecycle event 2 1
19:04:41.058: 8856: debug : remoteRelayDomainEventLifecycle:129 : Relaying domain lifecycle event 2 1
19:04:41.514: 8856: debug : remoteRelayDomainEventLifecycle:129 : Relaying domain lifecycle event 2 1
19:04:41.733: 8856: debug : remoteRelayDomainEventLifecycle:129 : Relaying domain lifecycle event 2 1
19:04:45.171: 8856: debug : remoteRelayDomainEventLifecycle:129 : Relaying domain lifecycle event 4 1
19:04:47.162: 8856: debug : remoteRelayDomainEventLifecycle:129 : Relaying domain lifecycle event 2 1
19:04:49.182: 8856: debug : remoteRelayDomainEventLifecycle:129 : Relaying domain lifecycle event 4 1
19:04:50.439: 8856: debug : remoteRelayDomainEventLifecycle:129 : Relaying domain lifecycle event 2 1
19:04:51.963: 8856: debug : remoteRelayDomainEventLifecycle:129 : Relaying domain lifecycle event 4 1
19:04:53.523: 8856: debug : remoteRelayDomainEventLifecycle:129 : Relaying domain lifecycle event 4 1
19:04:53.629: 8856: debug : remoteRelayDomainEventLifecycle:129 : Relaying domain lifecycle event 2 1
19:04:55.015: 8856: debug : remoteRelayDomainEventLifecycle:129 : Relaying domain lifecycle event 2 1
19:04:56.322: 8856: debug : remoteRelayDomainEventLifecycle:129 : Relaying domain lifecycle event 4 1
19:04:57.729: 8856: debug : remoteRelayDomainEventLifecycle:129 : Relaying domain lifecycle event 2 1
19:04:57.961: 8856: debug : remoteRelayDomainEventLifecycle:129 : Relaying domain lifecycle event 4 1
19:05:00.062: 8856: debug : remoteRelayDomainEventLifecycle:129 : Relaying domain lifecycle event 2 1
19:05:00.289: 8856: debug : remoteRelayDomainEventLifecycle:129 : Relaying domain lifecycle event 4 1
19:05:01.840: 8856: debug : remoteRelayDomainEventLifecycle:129 : Relaying domain lifecycle event 2 1
19:05:02.438: 8856: debug : remoteRelayDomainEventLifecycle:129 : Relaying domain lifecycle event 4 1
19:05:03.776: 8856: debug : remoteRelayDomainEventLifecycle:129 : Relaying domain lifecycle event 2 1
19:05:06.477: 8856: debug : remoteRelayDomainEventLifecycle:129 : Relaying domain lifecycle event 4 1
19:05:06.529: 8856: debug : remoteRelayDomainEventLifecycle:129 : Relaying domain lifecycle event 4 1
19:05:08.361: 8856: debug : remoteRelayDomainEventLifecycle:129 : Relaying domain lifecycle event 2 1
19:05:08.558: 8856: debug : remoteRelayDomainEventLifecycle:129 : Relaying domain lifecycle event 2 1
19:05:08.659: 8856: debug : remoteRelayDomainEventLifecycle:129 : Relaying domain lifecycle event 4 1
19:05:10.288: 8856: debug : remoteRelayDomainEventLifecycle:129 : Relaying domain lifecycle event 2 1
19:05:13.300: 8856: debug : remoteRelayDomainEventLifecycle:129 : Relaying domain lifecycle event 4 1
19:05:13.924: 8856: debug : remoteRelayDomainEventLifecycle:129 : Relaying domain lifecycle event 4 1
19:05:15.154: 8856: debug : remoteRelayDomainEventLifecycle:129 : Relaying domain lifecycle event 2 1
19:05:15.360: 8856: debug : remoteRelayDomainEventLifecycle:129 : Relaying domain lifecycle event 2 1
19:05:16.852: 8856: debug : remoteRelayDomainEventLifecycle:129 : Relaying domain lifecycle event 4 1
19:05:17.722: 8856: debug : remoteRelayDomainEventLifecycle:129 : Relaying domain lifecycle event 4 1
19:05:19.420: 8856: debug : remoteRelayDomainEventLifecycle:129 : Relaying domain lifecycle event 2 1
19:05:19.626: 8856: debug : remoteRelayDomainEventLifecycle:129 : Relaying domain lifecycle event 2 1
19:05:22.763: 8856: debug : remoteRelayDomainEventLifecycle:129 : Relaying domain lifecycle event 4 1
19:05:24.770: 8856: debug : remoteRelayDomainEventLifecycle:129 : Relaying domain lifecycle event 4 1
19:05:26.969: 8856: debug : remoteRelayDomainEventLifecycle:129 : Relaying domain lifecycle event 4 1
19:05:27.793: 8856: debug : remoteRelayDomainEventLifecycle:129 : Relaying domain lifecycle event 4 1


From libvirt.log (client side): 19:04:43 is the last Event logged in the client side. 

19:04:33.438: 8992: debug : remoteDomainEventDispatchFunc:4071 : Dispatch event 0x7faecc1e1bc0 0x7faee4033f20
19:04:33.438: 8992: debug : remoteDomainEventQueueFlush:4084 : Event queue flush 0x7faee4033f20
19:04:33.439: 8992: debug : remoteDomainEventDispatchFunc:4071 : Dispatch event 0x7faf1c004070 0x7faee4033f20
19:04:33.439: 8992: debug : remoteDomainEventDispatchFunc:4071 : Dispatch event 0x7faf1c007ed0 0x7faee4033f20
19:04:33.442: 8992: debug : remoteDomainEventDispatchFunc:4071 : Dispatch event 0x7faf1c008840 0x7faee4033f20
19:04:33.445: 8992: debug : remoteDomainEventDispatchFunc:4071 : Dispatch event 0x7fae44184430 0x7faee4033f20
19:04:33.451: 8992: debug : remoteDomainEventDispatchFunc:4071 : Dispatch event 0x7faf1c0195c0 0x7faee4033f20
19:04:37.953: 8992: debug : remoteDomainEventQueueFlush:4084 : Event queue flush 0x7faee4033f20
19:04:37.953: 8992: debug : remoteDomainEventDispatchFunc:4071 : Dispatch event 0x7fadf8053770 0x7faee4033f20
19:04:39.027: 8992: debug : remoteDomainEventQueueFlush:4084 : Event queue flush 0x7faee4033f20
19:04:39.027: 8992: debug : remoteDomainEventDispatchFunc:4071 : Dispatch event 0x7fadf8064150 0x7faee4033f20
19:04:39.476: 8992: debug : remoteDomainEventQueueFlush:4084 : Event queue flush 0x7faee4033f20
19:04:39.476: 8992: debug : remoteDomainEventDispatchFunc:4071 : Dispatch event 0x7fae0800ac80 0x7faee4033f20
19:04:40.148: 8992: debug : remoteDomainEventQueueFlush:4084 : Event queue flush 0x7faee4033f20
19:04:40.149: 8992: debug : remoteDomainEventDispatchFunc:4071 : Dispatch event 0x7fae0800ac80 0x7faee4033f20
19:04:40.369: 8992: debug : remoteDomainEventQueueFlush:4084 : Event queue flush 0x7faee4033f20
19:04:40.369: 8992: debug : remoteDomainEventDispatchFunc:4071 : Dispatch event 0x7faf1c006d30 0x7faee4033f20
19:04:41.059: 8992: debug : remoteDomainEventQueueFlush:4084 : Event queue flush 0x7faee4033f20
19:04:41.059: 8992: debug : remoteDomainEventDispatchFunc:4071 : Dispatch event 0x7faee8004390 0x7faee4033f20
19:04:43.644: 8992: debug : remoteDomainEventQueueFlush:4084 : Event queue flush 0x7faee4033f20
19:04:43.646: 8992: debug : remoteDomainEventDispatchFunc:4071 : Dispatch event 0x7faee8003f60 0x7faee4033f20

from 19:04:43 the client side stopped getting any events from libvirtd. 

you can see a day after, killed all domains:
09:39:55.076: 8856: debug : remoteRelayDomainEventLifecycle:129 : Relaying domain lifecycle event 5 5
.....
09:39:57.043: 8856: debug : remoteRelayDomainEventLifecycle:129 : Relaying domain lifecycle event 5 5

Libvirtd is relaying the VIR_DOMAIN_EVENT_STOPPED_FAILED, and the client side doesn't get anything.

# virsh -r list
 Id Name                 State
----------------------------------

# vdsClient -s 0 list table
87ab83f6-787f-48ce-a229-68b06231a75b  29822  NFS-LIB-TEST-055     Up                   10.35.110.65 10.35.110.66
c29270be-82fb-449f-8820-4efd8e6f3104  27488  NFS-LIB-TEST-027     Up                   10.35.110.125 10.35.110.127
9e883c14-9a64-441b-a90e-efd6ac0554b8  29554  NFS-LIB-TEST-031     Paused               10.35.110.122 10.35.110.124
65b72786-f4c6-4ce0-8f9e-bff669199e6d  27845  NFS-LIB-TEST-077     Up                   10.35.110.75 10.35.110.78
f52cd5db-1adb-4530-9e37-0c149e6f34b5  30006  NFS-LIB-TEST-053     Up                   10.35.102.248 10.35.102.250
1b13b1ac-6839-46b4-940b-b3a25cec4949  28568  NFS-LIB-TEST-017     Up                   10.35.110.55 10.35.110.56
f09a3ee7-2e46-4040-b614-20085687fb78  30275  NFS-LIB-TEST-059     Up                   10.35.110.100 10.35.110.102
3473d29c-16f9-4bd4-b64a-f173d82bff76  31409  NFS-LIB-TEST-007     Up                   10.35.107.15 10.35.108.41
a609803e-1744-4bb1-88e9-28e6698d3f46  27919  NFS-LIB-TEST-024     Up                   10.35.110.37 10.35.110.39
55bac48c-5f6c-4a70-a2c4-b48c77ec23be  31044  NFS-LIB-TEST-070     Up                   10.35.110.76 10.35.110.79
ec42a42b-5873-4bf9-b152-44d1f3045a50  30746  NFS-LIB-TEST-074     Up                   10.35.110.128 10.35.110.130
c26cec8b-3622-4efd-a01a-401e4a46a18d  31666  NFS-LIB-TEST-096     Up                   10.35.110.114 10.35.110.115
0b608344-d206-46fa-8a8f-6e766b0d450b  31765  NFS-LIB-TEST-071     Up                   10.35.110.129 10.35.110.131
37278fe2-6532-4c59-86e1-4119087cf8b5  30859  NFS-LIB-TEST-086     Up                   10.35.110.101 10.35.110.105
2bd1e956-f09b-4ab0-ba37-feba6c54a378  29661  NFS-LIB-TEST-072     Paused               10.35.110.77 10.35.110.81
c159e6ef-3b09-40bc-8d49-6b09e0ea0ddc  30637  NFS-LIB-TEST-093     Up                   10.35.110.135 10.35.110.136
c8e4cef8-475c-42c7-8961-4987ce8a0de8  31116  NFS-LIB-TEST-051     Up                   10.35.110.61 10.35.110.62
d08fe51b-ebc0-46ec-b357-92ab22a18aeb  30148  NFS-LIB-TEST-021     Up                   10.35.109.19 10.35.110.23
574c1f75-8134-4ccc-9814-ff5b4397d5a7  29752  NFS-LIB-TEST-065     Paused               10.35.110.111 10.35.110.112
e3f16fcf-2efe-4a35-9c74-4bbcb334777f  27869  NFS-LIB-TEST-080     Up                   10.35.110.132 10.35.110.133
c38fddf9-9d38-4813-a4a5-89fe2e3c39c8  31715  NFS-LIB-TEST-075     Up                   10.35.110.89 10.35.110.90
bb4e701e-8b0a-46f1-bba6-d901e6240d5d  31226  NFS-LIB-TEST-013     Up                   10.35.110.95 10.35.110.97
6b0a4987-ee04-4a53-9e45-7a3949bd9240  31514  NFS-LIB-TEST-002     Up                   10.35.110.70 10.35.110.72
eb24c896-38c3-43c8-aaa2-3b39c4c6d2e3  30392  NFS-LIB-TEST-063     Up                   10.35.110.41 10.35.110.44
752ec84e-ff31-4401-b021-0efb6d242fd7  30524  NFS-LIB-TEST-091     Up                   10.35.104.10 10.35.107.13

Version-Release number of selected component (if applicable):
libvirt-0.9.4-14.el6.x86_64

How reproducible:
20%

Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:
attahced libvirtd.log, libvirt.log (client side), vdsm.log.

Comment 2 Michal Privoznik 2011-10-06 17:24:19 UTC
David,

I've created a special build for you https://brewweb.devel.redhat.com/taskinfo?taskID=3690802

It contains a patch which I think it's proper fix. Could you please give it a try?

Comment 3 David Naori 2011-10-10 10:50:00 UTC
(In reply to comment #2)
> David,
> 
> I've created a special build for you
> https://brewweb.devel.redhat.com/taskinfo?taskID=3690802
> 
> It contains a patch which I think it's proper fix. Could you please give it a
> try?

Tested with the patch- libvirtd crashed during multiple migrations.

#0  __pthread_mutex_lock (mutex=0x0) at pthread_mutex_lock.c:50
#1  0x000000000047709a in qemuDomainObjEnterMonitorInternal (driver=0x7f273c00cb70, driver_locked=true, obj=0x7f273c108f40, asyncJob=QEMU_ASYNC_JOB_NONE) at qemu/qemu_domain.c:917
#2  0x000000000044ef4d in qemuDomainMonitorCommand (domain=<value optimized out>, cmd=0x7f26a8002a80 "{\"execute\": \"query-blockstats\"}", result=0x7f26a8002c60, flags=0) at qemu/qemu_driver.c:9769
#3  0x00007f274e1a3af9 in virDomainQemuMonitorCommand (domain=0x7f26a8000f60, cmd=0x7f26a8002a80 "{\"execute\": \"query-blockstats\"}", result=0x7f26a8002c60, flags=0) at libvirt-qemu.c:102
#4  0x000000000041fe64 in qemuDispatchMonitorCommand (server=<value optimized out>, client=<value optimized out>, hdr=<value optimized out>, rerr=0x7f274752dc30, args=<value optimized out>, ret=0x7f26a8002c60) at remote.c:2772
#5  qemuDispatchMonitorCommandHelper (server=<value optimized out>, client=<value optimized out>, hdr=<value optimized out>, rerr=0x7f274752dc30, args=<value optimized out>, ret=0x7f26a8002c60) at qemu_dispatch.h:72
#6  0x000000000043dc1e in virNetServerProgramDispatchCall (prog=0x1fa8e00, server=0x1f9df70, client=0x1facdc0, msg=0x2228b20) at rpc/virnetserverprogram.c:401
#7  virNetServerProgramDispatch (prog=0x1fa8e00, server=0x1f9df70, client=0x1facdc0, msg=0x2228b20) at rpc/virnetserverprogram.c:287
#8  0x000000000044066c in virNetServerHandleJob (jobOpaque=<value optimized out>, opaque=0x1f9df70) at rpc/virnetserver.c:136
#9  0x00007f274de3335c in virThreadPoolWorker (opaque=<value optimized out>) at util/threadpool.c:144
#10 0x00007f274de32c72 in virThreadHelper (data=<value optimized out>) at util/threads-pthread.c:157
#11 0x00000031dba077f1 in start_thread (arg=0x7f274752e700) at pthread_create.c:301
#12 0x00000031db2e570d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Comment 4 Michal Privoznik 2011-10-10 11:22:14 UTC
Yeah, but that's a separate issue, since in first case we fail dispatching events on client, but here we crash on locking NULL. So I'll send patch upstream as it seems the current issue is fixed with it.

Comment 5 Michal Privoznik 2011-10-10 11:45:25 UTC
Just for the record, David helped me to reproduce and what I've found out:
Events are transferred to the client, but client somehow disables timer for dispatching them. My rough guess is, in virDomainEventStateQueue() between actual even queuing and check for queue size, we can insert another event (from different thread) as this function rely on lock from callers (who doesn't need to share the same lock).

Comment 6 Michal Privoznik 2011-10-10 11:46:34 UTC
Patch sent upstream:

https://www.redhat.com/archives/libvir-list/2011-October/msg00309.html

Comment 7 Michal Privoznik 2011-10-11 09:23:49 UTC
David,

I've created another patch for that different issue and thus new scratch build:

https://brewweb.devel.redhat.com/taskinfo?taskID=3699039

Please, give it a try.

Comment 10 David Naori 2011-10-12 10:12:22 UTC
(In reply to comment #7)
> David,
> 
> I've created another patch for that different issue and thus new scratch build:
> 
> https://brewweb.devel.redhat.com/taskinfo?taskID=3699039
> 
> Please, give it a try.

I was able to reproduce the bug with this build.

Comment 11 Michal Privoznik 2011-10-13 11:37:57 UTC
David,

the last build: http://brewweb.devel.redhat.com/brew/taskinfo?taskID=3705666

The first tried to fix issue, but was not complete.
The second build fixed irrelevant issue (see https://bugzilla.redhat.com/show_bug.cgi?id=745734)
This build contains complete fix for that.

Comment 12 Michal Privoznik 2011-10-14 13:45:30 UTC
Although still waiting for David confirmation, moving to POST, as the patch is definitely related and needed:

http://post-office.corp.redhat.com/archives/rhvirt-patches/2011-October/msg00472.html

Comment 13 David Naori 2011-10-16 14:49:11 UTC
(In reply to comment #11)
> David,
> 
> the last build: http://brewweb.devel.redhat.com/brew/taskinfo?taskID=3705666
> 
> The first tried to fix issue, but was not complete.
> The second build fixed irrelevant issue (see
> https://bugzilla.redhat.com/show_bug.cgi?id=745734)
> This build contains complete fix for that.

Could not reproduce the problem with this build, looks like a good fix.

Comment 16 yuping zhang 2011-10-18 06:49:36 UTC
(In reply to comment #13)
> (In reply to comment #11)
> > David,
> > 
> > the last build: http://brewweb.devel.redhat.com/brew/taskinfo?taskID=3705666
> > 
> > The first tried to fix issue, but was not complete.
> > The second build fixed irrelevant issue (see
> > https://bugzilla.redhat.com/show_bug.cgi?id=745734)
> > This build contains complete fix for that.
> 
> Could not reproduce the problem with this build, looks like a good fix.

Hi David

As we can not reproduce this issue,can you please help to test this issue with latest libvirt build libvirt-0.9.4-18.el6 ? Link:  https://brewweb.devel.redhat.com/buildinfo?buildID=183686
Thanks very much.

Comment 18 weizhang 2011-10-19 11:13:26 UTC
Can not reproduce on 
qemu-kvm-0.12.1.2-2.199.el6.x86_64
kernel-2.6.32-207.el6.x86_64
libvirt-0.9.4-18.el6.x86_64

test concurrent 60 vms migration 10 times with command
# for i in {1..60}; do virsh migrate --live test$i qemu+tcp://10.66.82.125/system & done

using python /usr/share/doc/libvirt-python-0.9.4/events-python/event-test.py
to get event

During migration, libvirt python can get the event
After 10 times migration, start a guest, the boot event also can be get

Comment 20 David Naori 2011-10-19 11:40:45 UTC
Tested on libvirt-0.9.4-19.el6.x86_64.

Ran bidirectional migration of 100 vm between 2 host repeatedly.

Followed vdsm & libvirtd logs real-time to see that vdsm is getting every Relayed event by libvirtd and without major Delays.

=> Verified.

Comment 23 errata-xmlrpc 2011-12-06 11:35:04 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2011-1513.html


Note You need to log in before you can comment on or make changes to this bug.