Bug 743817
Summary: | [libvirt][events] sometimes after migrating a large number of domains, Events does not arrive to the client side. | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | David Naori <dnaori> | ||||
Component: | libvirt | Assignee: | Michal Privoznik <mprivozn> | ||||
Status: | CLOSED ERRATA | QA Contact: | Virtualization Bugs <virt-bugs> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | high | ||||||
Version: | 6.4 | CC: | abaron, acathrow, ajia, dallan, danken, dnaori, gren, mgoldboi, mzhan, rwu, syeghiay, veillard, weizhan, ykaul, yupzhang | ||||
Target Milestone: | rc | Keywords: | Regression | ||||
Target Release: | --- | ||||||
Hardware: | x86_64 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | libvirt-0.9.4-18.el6 | Doc Type: | Bug Fix | ||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2011-12-06 11:35:04 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 748554 | ||||||
Attachments: |
|
David, I've created a special build for you https://brewweb.devel.redhat.com/taskinfo?taskID=3690802 It contains a patch which I think it's proper fix. Could you please give it a try? (In reply to comment #2) > David, > > I've created a special build for you > https://brewweb.devel.redhat.com/taskinfo?taskID=3690802 > > It contains a patch which I think it's proper fix. Could you please give it a > try? Tested with the patch- libvirtd crashed during multiple migrations. #0 __pthread_mutex_lock (mutex=0x0) at pthread_mutex_lock.c:50 #1 0x000000000047709a in qemuDomainObjEnterMonitorInternal (driver=0x7f273c00cb70, driver_locked=true, obj=0x7f273c108f40, asyncJob=QEMU_ASYNC_JOB_NONE) at qemu/qemu_domain.c:917 #2 0x000000000044ef4d in qemuDomainMonitorCommand (domain=<value optimized out>, cmd=0x7f26a8002a80 "{\"execute\": \"query-blockstats\"}", result=0x7f26a8002c60, flags=0) at qemu/qemu_driver.c:9769 #3 0x00007f274e1a3af9 in virDomainQemuMonitorCommand (domain=0x7f26a8000f60, cmd=0x7f26a8002a80 "{\"execute\": \"query-blockstats\"}", result=0x7f26a8002c60, flags=0) at libvirt-qemu.c:102 #4 0x000000000041fe64 in qemuDispatchMonitorCommand (server=<value optimized out>, client=<value optimized out>, hdr=<value optimized out>, rerr=0x7f274752dc30, args=<value optimized out>, ret=0x7f26a8002c60) at remote.c:2772 #5 qemuDispatchMonitorCommandHelper (server=<value optimized out>, client=<value optimized out>, hdr=<value optimized out>, rerr=0x7f274752dc30, args=<value optimized out>, ret=0x7f26a8002c60) at qemu_dispatch.h:72 #6 0x000000000043dc1e in virNetServerProgramDispatchCall (prog=0x1fa8e00, server=0x1f9df70, client=0x1facdc0, msg=0x2228b20) at rpc/virnetserverprogram.c:401 #7 virNetServerProgramDispatch (prog=0x1fa8e00, server=0x1f9df70, client=0x1facdc0, msg=0x2228b20) at rpc/virnetserverprogram.c:287 #8 0x000000000044066c in virNetServerHandleJob (jobOpaque=<value optimized out>, opaque=0x1f9df70) at rpc/virnetserver.c:136 #9 0x00007f274de3335c in virThreadPoolWorker (opaque=<value optimized out>) at util/threadpool.c:144 #10 0x00007f274de32c72 in virThreadHelper (data=<value optimized out>) at util/threads-pthread.c:157 #11 0x00000031dba077f1 in start_thread (arg=0x7f274752e700) at pthread_create.c:301 #12 0x00000031db2e570d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 Yeah, but that's a separate issue, since in first case we fail dispatching events on client, but here we crash on locking NULL. So I'll send patch upstream as it seems the current issue is fixed with it. Just for the record, David helped me to reproduce and what I've found out: Events are transferred to the client, but client somehow disables timer for dispatching them. My rough guess is, in virDomainEventStateQueue() between actual even queuing and check for queue size, we can insert another event (from different thread) as this function rely on lock from callers (who doesn't need to share the same lock). Patch sent upstream: https://www.redhat.com/archives/libvir-list/2011-October/msg00309.html David, I've created another patch for that different issue and thus new scratch build: https://brewweb.devel.redhat.com/taskinfo?taskID=3699039 Please, give it a try. (In reply to comment #7) > David, > > I've created another patch for that different issue and thus new scratch build: > > https://brewweb.devel.redhat.com/taskinfo?taskID=3699039 > > Please, give it a try. I was able to reproduce the bug with this build. David, the last build: http://brewweb.devel.redhat.com/brew/taskinfo?taskID=3705666 The first tried to fix issue, but was not complete. The second build fixed irrelevant issue (see https://bugzilla.redhat.com/show_bug.cgi?id=745734) This build contains complete fix for that. Although still waiting for David confirmation, moving to POST, as the patch is definitely related and needed: http://post-office.corp.redhat.com/archives/rhvirt-patches/2011-October/msg00472.html (In reply to comment #11) > David, > > the last build: http://brewweb.devel.redhat.com/brew/taskinfo?taskID=3705666 > > The first tried to fix issue, but was not complete. > The second build fixed irrelevant issue (see > https://bugzilla.redhat.com/show_bug.cgi?id=745734) > This build contains complete fix for that. Could not reproduce the problem with this build, looks like a good fix. (In reply to comment #13) > (In reply to comment #11) > > David, > > > > the last build: http://brewweb.devel.redhat.com/brew/taskinfo?taskID=3705666 > > > > The first tried to fix issue, but was not complete. > > The second build fixed irrelevant issue (see > > https://bugzilla.redhat.com/show_bug.cgi?id=745734) > > This build contains complete fix for that. > > Could not reproduce the problem with this build, looks like a good fix. Hi David As we can not reproduce this issue,can you please help to test this issue with latest libvirt build libvirt-0.9.4-18.el6 ? Link: https://brewweb.devel.redhat.com/buildinfo?buildID=183686 Thanks very much. Can not reproduce on qemu-kvm-0.12.1.2-2.199.el6.x86_64 kernel-2.6.32-207.el6.x86_64 libvirt-0.9.4-18.el6.x86_64 test concurrent 60 vms migration 10 times with command # for i in {1..60}; do virsh migrate --live test$i qemu+tcp://10.66.82.125/system & done using python /usr/share/doc/libvirt-python-0.9.4/events-python/event-test.py to get event During migration, libvirt python can get the event After 10 times migration, start a guest, the boot event also can be get Tested on libvirt-0.9.4-19.el6.x86_64. Ran bidirectional migration of 100 vm between 2 host repeatedly. Followed vdsm & libvirtd logs real-time to see that vdsm is getting every Relayed event by libvirtd and without major Delays. => Verified. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2011-1513.html |
Created attachment 526641 [details] libvirtd.log libvirt.log vdsm.log Description of problem: Migrated ~50 domains via vdsm, the migrations finish up successfully, but the client side after getting a large number of events (VIR_DOMAIN_EVENT_RESUMED_MIGRATED, VIR_DOMAIN_EVENT_STARTED_MIGRATED) it stopped getting **any** events from libvirtd. (vdsm reporting paused vms while actually up) libvirtd Relaying events (libvirtd.log): 19:04:24.478: 8856: debug : remoteRelayDomainEventLifecycle:129 : Relaying domain lifecycle event 2 1 19:04:26.520: 8856: debug : remoteRelayDomainEventLifecycle:129 : Relaying domain lifecycle event 2 1 19:04:26.536: 8856: debug : remoteRelayDomainEventLifecycle:129 : Relaying domain lifecycle event 2 1 19:04:26.736: 8856: debug : remoteRelayDomainEventLifecycle:129 : Relaying domain lifecycle event 2 1 19:04:28.022: 8856: debug : remoteRelayDomainEventLifecycle:129 : Relaying domain lifecycle event 4 1 19:04:30.019: 8856: debug : remoteRelayDomainEventLifecycle:129 : Relaying domain lifecycle event 2 1 19:04:37.952: 8856: debug : remoteRelayDomainEventLifecycle:129 : Relaying domain lifecycle event 4 1 19:04:39.025: 8856: debug : remoteRelayDomainEventLifecycle:129 : Relaying domain lifecycle event 4 1 19:04:39.474: 8856: debug : remoteRelayDomainEventLifecycle:129 : Relaying domain lifecycle event 4 1 19:04:40.145: 8856: debug : remoteRelayDomainEventLifecycle:129 : Relaying domain lifecycle event 4 1 19:04:40.368: 8856: debug : remoteRelayDomainEventLifecycle:129 : Relaying domain lifecycle event 2 1 19:04:41.058: 8856: debug : remoteRelayDomainEventLifecycle:129 : Relaying domain lifecycle event 2 1 19:04:41.514: 8856: debug : remoteRelayDomainEventLifecycle:129 : Relaying domain lifecycle event 2 1 19:04:41.733: 8856: debug : remoteRelayDomainEventLifecycle:129 : Relaying domain lifecycle event 2 1 19:04:45.171: 8856: debug : remoteRelayDomainEventLifecycle:129 : Relaying domain lifecycle event 4 1 19:04:47.162: 8856: debug : remoteRelayDomainEventLifecycle:129 : Relaying domain lifecycle event 2 1 19:04:49.182: 8856: debug : remoteRelayDomainEventLifecycle:129 : Relaying domain lifecycle event 4 1 19:04:50.439: 8856: debug : remoteRelayDomainEventLifecycle:129 : Relaying domain lifecycle event 2 1 19:04:51.963: 8856: debug : remoteRelayDomainEventLifecycle:129 : Relaying domain lifecycle event 4 1 19:04:53.523: 8856: debug : remoteRelayDomainEventLifecycle:129 : Relaying domain lifecycle event 4 1 19:04:53.629: 8856: debug : remoteRelayDomainEventLifecycle:129 : Relaying domain lifecycle event 2 1 19:04:55.015: 8856: debug : remoteRelayDomainEventLifecycle:129 : Relaying domain lifecycle event 2 1 19:04:56.322: 8856: debug : remoteRelayDomainEventLifecycle:129 : Relaying domain lifecycle event 4 1 19:04:57.729: 8856: debug : remoteRelayDomainEventLifecycle:129 : Relaying domain lifecycle event 2 1 19:04:57.961: 8856: debug : remoteRelayDomainEventLifecycle:129 : Relaying domain lifecycle event 4 1 19:05:00.062: 8856: debug : remoteRelayDomainEventLifecycle:129 : Relaying domain lifecycle event 2 1 19:05:00.289: 8856: debug : remoteRelayDomainEventLifecycle:129 : Relaying domain lifecycle event 4 1 19:05:01.840: 8856: debug : remoteRelayDomainEventLifecycle:129 : Relaying domain lifecycle event 2 1 19:05:02.438: 8856: debug : remoteRelayDomainEventLifecycle:129 : Relaying domain lifecycle event 4 1 19:05:03.776: 8856: debug : remoteRelayDomainEventLifecycle:129 : Relaying domain lifecycle event 2 1 19:05:06.477: 8856: debug : remoteRelayDomainEventLifecycle:129 : Relaying domain lifecycle event 4 1 19:05:06.529: 8856: debug : remoteRelayDomainEventLifecycle:129 : Relaying domain lifecycle event 4 1 19:05:08.361: 8856: debug : remoteRelayDomainEventLifecycle:129 : Relaying domain lifecycle event 2 1 19:05:08.558: 8856: debug : remoteRelayDomainEventLifecycle:129 : Relaying domain lifecycle event 2 1 19:05:08.659: 8856: debug : remoteRelayDomainEventLifecycle:129 : Relaying domain lifecycle event 4 1 19:05:10.288: 8856: debug : remoteRelayDomainEventLifecycle:129 : Relaying domain lifecycle event 2 1 19:05:13.300: 8856: debug : remoteRelayDomainEventLifecycle:129 : Relaying domain lifecycle event 4 1 19:05:13.924: 8856: debug : remoteRelayDomainEventLifecycle:129 : Relaying domain lifecycle event 4 1 19:05:15.154: 8856: debug : remoteRelayDomainEventLifecycle:129 : Relaying domain lifecycle event 2 1 19:05:15.360: 8856: debug : remoteRelayDomainEventLifecycle:129 : Relaying domain lifecycle event 2 1 19:05:16.852: 8856: debug : remoteRelayDomainEventLifecycle:129 : Relaying domain lifecycle event 4 1 19:05:17.722: 8856: debug : remoteRelayDomainEventLifecycle:129 : Relaying domain lifecycle event 4 1 19:05:19.420: 8856: debug : remoteRelayDomainEventLifecycle:129 : Relaying domain lifecycle event 2 1 19:05:19.626: 8856: debug : remoteRelayDomainEventLifecycle:129 : Relaying domain lifecycle event 2 1 19:05:22.763: 8856: debug : remoteRelayDomainEventLifecycle:129 : Relaying domain lifecycle event 4 1 19:05:24.770: 8856: debug : remoteRelayDomainEventLifecycle:129 : Relaying domain lifecycle event 4 1 19:05:26.969: 8856: debug : remoteRelayDomainEventLifecycle:129 : Relaying domain lifecycle event 4 1 19:05:27.793: 8856: debug : remoteRelayDomainEventLifecycle:129 : Relaying domain lifecycle event 4 1 From libvirt.log (client side): 19:04:43 is the last Event logged in the client side. 19:04:33.438: 8992: debug : remoteDomainEventDispatchFunc:4071 : Dispatch event 0x7faecc1e1bc0 0x7faee4033f20 19:04:33.438: 8992: debug : remoteDomainEventQueueFlush:4084 : Event queue flush 0x7faee4033f20 19:04:33.439: 8992: debug : remoteDomainEventDispatchFunc:4071 : Dispatch event 0x7faf1c004070 0x7faee4033f20 19:04:33.439: 8992: debug : remoteDomainEventDispatchFunc:4071 : Dispatch event 0x7faf1c007ed0 0x7faee4033f20 19:04:33.442: 8992: debug : remoteDomainEventDispatchFunc:4071 : Dispatch event 0x7faf1c008840 0x7faee4033f20 19:04:33.445: 8992: debug : remoteDomainEventDispatchFunc:4071 : Dispatch event 0x7fae44184430 0x7faee4033f20 19:04:33.451: 8992: debug : remoteDomainEventDispatchFunc:4071 : Dispatch event 0x7faf1c0195c0 0x7faee4033f20 19:04:37.953: 8992: debug : remoteDomainEventQueueFlush:4084 : Event queue flush 0x7faee4033f20 19:04:37.953: 8992: debug : remoteDomainEventDispatchFunc:4071 : Dispatch event 0x7fadf8053770 0x7faee4033f20 19:04:39.027: 8992: debug : remoteDomainEventQueueFlush:4084 : Event queue flush 0x7faee4033f20 19:04:39.027: 8992: debug : remoteDomainEventDispatchFunc:4071 : Dispatch event 0x7fadf8064150 0x7faee4033f20 19:04:39.476: 8992: debug : remoteDomainEventQueueFlush:4084 : Event queue flush 0x7faee4033f20 19:04:39.476: 8992: debug : remoteDomainEventDispatchFunc:4071 : Dispatch event 0x7fae0800ac80 0x7faee4033f20 19:04:40.148: 8992: debug : remoteDomainEventQueueFlush:4084 : Event queue flush 0x7faee4033f20 19:04:40.149: 8992: debug : remoteDomainEventDispatchFunc:4071 : Dispatch event 0x7fae0800ac80 0x7faee4033f20 19:04:40.369: 8992: debug : remoteDomainEventQueueFlush:4084 : Event queue flush 0x7faee4033f20 19:04:40.369: 8992: debug : remoteDomainEventDispatchFunc:4071 : Dispatch event 0x7faf1c006d30 0x7faee4033f20 19:04:41.059: 8992: debug : remoteDomainEventQueueFlush:4084 : Event queue flush 0x7faee4033f20 19:04:41.059: 8992: debug : remoteDomainEventDispatchFunc:4071 : Dispatch event 0x7faee8004390 0x7faee4033f20 19:04:43.644: 8992: debug : remoteDomainEventQueueFlush:4084 : Event queue flush 0x7faee4033f20 19:04:43.646: 8992: debug : remoteDomainEventDispatchFunc:4071 : Dispatch event 0x7faee8003f60 0x7faee4033f20 from 19:04:43 the client side stopped getting any events from libvirtd. you can see a day after, killed all domains: 09:39:55.076: 8856: debug : remoteRelayDomainEventLifecycle:129 : Relaying domain lifecycle event 5 5 ..... 09:39:57.043: 8856: debug : remoteRelayDomainEventLifecycle:129 : Relaying domain lifecycle event 5 5 Libvirtd is relaying the VIR_DOMAIN_EVENT_STOPPED_FAILED, and the client side doesn't get anything. # virsh -r list Id Name State ---------------------------------- # vdsClient -s 0 list table 87ab83f6-787f-48ce-a229-68b06231a75b 29822 NFS-LIB-TEST-055 Up 10.35.110.65 10.35.110.66 c29270be-82fb-449f-8820-4efd8e6f3104 27488 NFS-LIB-TEST-027 Up 10.35.110.125 10.35.110.127 9e883c14-9a64-441b-a90e-efd6ac0554b8 29554 NFS-LIB-TEST-031 Paused 10.35.110.122 10.35.110.124 65b72786-f4c6-4ce0-8f9e-bff669199e6d 27845 NFS-LIB-TEST-077 Up 10.35.110.75 10.35.110.78 f52cd5db-1adb-4530-9e37-0c149e6f34b5 30006 NFS-LIB-TEST-053 Up 10.35.102.248 10.35.102.250 1b13b1ac-6839-46b4-940b-b3a25cec4949 28568 NFS-LIB-TEST-017 Up 10.35.110.55 10.35.110.56 f09a3ee7-2e46-4040-b614-20085687fb78 30275 NFS-LIB-TEST-059 Up 10.35.110.100 10.35.110.102 3473d29c-16f9-4bd4-b64a-f173d82bff76 31409 NFS-LIB-TEST-007 Up 10.35.107.15 10.35.108.41 a609803e-1744-4bb1-88e9-28e6698d3f46 27919 NFS-LIB-TEST-024 Up 10.35.110.37 10.35.110.39 55bac48c-5f6c-4a70-a2c4-b48c77ec23be 31044 NFS-LIB-TEST-070 Up 10.35.110.76 10.35.110.79 ec42a42b-5873-4bf9-b152-44d1f3045a50 30746 NFS-LIB-TEST-074 Up 10.35.110.128 10.35.110.130 c26cec8b-3622-4efd-a01a-401e4a46a18d 31666 NFS-LIB-TEST-096 Up 10.35.110.114 10.35.110.115 0b608344-d206-46fa-8a8f-6e766b0d450b 31765 NFS-LIB-TEST-071 Up 10.35.110.129 10.35.110.131 37278fe2-6532-4c59-86e1-4119087cf8b5 30859 NFS-LIB-TEST-086 Up 10.35.110.101 10.35.110.105 2bd1e956-f09b-4ab0-ba37-feba6c54a378 29661 NFS-LIB-TEST-072 Paused 10.35.110.77 10.35.110.81 c159e6ef-3b09-40bc-8d49-6b09e0ea0ddc 30637 NFS-LIB-TEST-093 Up 10.35.110.135 10.35.110.136 c8e4cef8-475c-42c7-8961-4987ce8a0de8 31116 NFS-LIB-TEST-051 Up 10.35.110.61 10.35.110.62 d08fe51b-ebc0-46ec-b357-92ab22a18aeb 30148 NFS-LIB-TEST-021 Up 10.35.109.19 10.35.110.23 574c1f75-8134-4ccc-9814-ff5b4397d5a7 29752 NFS-LIB-TEST-065 Paused 10.35.110.111 10.35.110.112 e3f16fcf-2efe-4a35-9c74-4bbcb334777f 27869 NFS-LIB-TEST-080 Up 10.35.110.132 10.35.110.133 c38fddf9-9d38-4813-a4a5-89fe2e3c39c8 31715 NFS-LIB-TEST-075 Up 10.35.110.89 10.35.110.90 bb4e701e-8b0a-46f1-bba6-d901e6240d5d 31226 NFS-LIB-TEST-013 Up 10.35.110.95 10.35.110.97 6b0a4987-ee04-4a53-9e45-7a3949bd9240 31514 NFS-LIB-TEST-002 Up 10.35.110.70 10.35.110.72 eb24c896-38c3-43c8-aaa2-3b39c4c6d2e3 30392 NFS-LIB-TEST-063 Up 10.35.110.41 10.35.110.44 752ec84e-ff31-4401-b021-0efb6d242fd7 30524 NFS-LIB-TEST-091 Up 10.35.104.10 10.35.107.13 Version-Release number of selected component (if applicable): libvirt-0.9.4-14.el6.x86_64 How reproducible: 20% Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: attahced libvirtd.log, libvirt.log (client side), vdsm.log.