Description of problem: Libvirt domain lifecycle events are not fired after domain migration if the domain api on the target has never been invoked. Version-Release number of selected component (if applicable): libvirt 4.2.0 How reproducible: 100% Steps to Reproduce: 1. On target node, start libvirt and only watch for lifecycle events. Do not invoke anything related to the domain api 2. Perform a live migration to the target node. 3. No domain lifecycle events are received after the migration completes even though the domain is not defined and active on the target. Here's where things get strange... On the target node, if the domain api is being invoked to list domains or define domains before the migration takes place, life cycle events are triggered. So, as a work around for this issue, we literally have a background thread performing the equivalent of 'virsh list' on a periodic loop in order to ensure life cycle events come in.
Could you please elaborate what's meant by "domain api being invoked"? Also could you please elaborate how do you receive the events?
Do you run a separate thread with an event loop? It is mandatory for delivering events.
The event loop runs in a separate thread. In this case it's actually a goroutine since we're using the libvirt-go bindings. For the "domain api being invoked" question, I see the way I worded that is unclear. What I mean is that I've observed a correlation that indicates lifecycle events being fired depends on whether or not other seemingly unrelated functions are called. For example. If I'm listening on a target node for a migration to arrive, I never receive the lifecycle event indicating the domain started on the target... However if I call a function that lists all the domains and retrieves the found domain's xml in a background thread while waiting for the migration to arrive in another thread, then I do get the lifecycle event. "Also could you please elaborate how do you receive the events?" We're registering a callback using the libvirt-go bindings. Here's the binding. https://github.com/libvirt/libvirt-go/blob/master/domain_events.go#L961 Here's the libvirt function that binding actually invokes. https://libvirt.org/html/libvirt-libvirt-domain.html#virConnectDomainEventRegisterAny The logic on our side is essentially just this. ------------------- libvirt.EventRegisterDefaultImpl() go func() { for { if res := libvirt.EventRunDefaultImpl(); res != nil { // failed listening to events, retry time.Sleep(time.Second) } } }() entrypointCallback := func(c *libvirt.Connect, d *libvirt.Domain, event *libvirt.DomainEventLifecycle) { fmt.Printf("yay got an event %v", event) } domainConn.DomainEventLifecycleRegister(entrypointCallback) --------------------- We never see the "yay got an event" log message for a migrated domain. We do however receive the log message if I add another goroutine that sits in a loop listing all known domains and retrieving their domain xml. So, using that same domainConn object if I add something like this in the background then the lifecycle events work. -------------------------- go func() { for { doms, _ := domainConn.ListAllDomains(libvirt.CONNECT_LIST_DOMAINS_ACTIVE | libvirt.CONNECT_LIST_DOMAINS_INACTIVE) for _, dom := range doms { dom.GetXMLDesc(libvirt.DOMAIN_XML_MIGRATABLE) dom.Free() } time.Sleep(time.Second*5) } }() -------------------------- Yes, I know that sounds crazy.
(In reply to David Vossel from comment #3) [...] > The logic on our side is essentially just this. > > ------------------- > libvirt.EventRegisterDefaultImpl() > go func() { > for { > if res := libvirt.EventRunDefaultImpl(); res != nil { > // failed listening to events, retry > time.Sleep(time.Second) I presume the timeout is 1 second here. Libvirt's eventloop is meant to be run without a timeout since it blocks until events to process arrive. > } > } > }() > [...] > -------------------------- > domainConn.ListAllDomains(libvirt.CONNECT_LIST_DOMAINS_ACTIVE | > libvirt.CONNECT_LIST_DOMAINS_INACTIVE) > for _, dom := range doms { > dom.GetXMLDesc(libvirt.DOMAIN_XML_MIGRATABLE) Which would explain that this fixes it, since an API processes all pending requests _without_ a timeout until the response for the API is received. > dom.Free() > } > time.Sleep(time.Second*5) > } > }() > -------------------------- > > Yes, I know that sounds crazy. Well, I think the sleep in the eventloop causes it being saturated by keepalive requests and can't get to process your events until you invoke the API which processes all incomming data. If removing the timeout does not help please try the following: Could you please also retry your scenario while waiting for events using virsh: virsh event --loop --all --timestamp
(In reply to Peter Krempa from comment #4) > (In reply to David Vossel from comment #3) > > [...] > > > The logic on our side is essentially just this. > > > > ------------------- > > libvirt.EventRegisterDefaultImpl() > > go func() { > > for { > > if res := libvirt.EventRunDefaultImpl(); res != nil { > > // failed listening to events, retry > > time.Sleep(time.Second) > > I presume the timeout is 1 second here. Libvirt's eventloop is meant to be > run without a timeout since it blocks until events to process arrive. Sorry I've misread the code. Well it indeed should run without timeout here ... please try the virsh event listener to see whether that's a go-specific problem: > Could you please also retry your scenario while waiting for events using > virsh: > > virsh event --loop --all --timestamp
(In reply to David Vossel from comment #3) > The event loop runs in a separate thread. In this case it's actually a > goroutine since we're using the libvirt-go bindings. > > For the "domain api being invoked" question, I see the way I worded that is > unclear. What I mean is that I've observed a correlation that indicates > lifecycle events being fired depends on whether or not other seemingly > unrelated functions are called. > > For example. If I'm listening on a target node for a migration to arrive, I > never receive the lifecycle event indicating the domain started on the > target... However if I call a function that lists all the domains and > retrieves the found domain's xml in a background thread while waiting for > the migration to arrive in another thread, then I do get the lifecycle event. > > "Also could you please elaborate how do you receive the events?" > > We're registering a callback using the libvirt-go bindings. > > Here's the binding. > https://github.com/libvirt/libvirt-go/blob/master/domain_events.go#L961 > > Here's the libvirt function that binding actually invokes. > https://libvirt.org/html/libvirt-libvirt-domain. > html#virConnectDomainEventRegisterAny > > > The logic on our side is essentially just this. > > ------------------- > libvirt.EventRegisterDefaultImpl() > go func() { > for { > if res := libvirt.EventRunDefaultImpl(); res != nil { > // failed listening to events, retry > time.Sleep(time.Second) > } > } > }() IIUC, that is from this code: https://github.com/kubevirt/kubevirt/blob/master/cmd/virt-launcher/virt-launcher.go#L125 which is called from https://github.com/kubevirt/kubevirt/blob/master/cmd/virt-launcher/virt-launcher.go#L400 This, however, is *after* you have already opened the libvirt connection https://github.com/kubevirt/kubevirt/blob/master/cmd/virt-launcher/virt-launcher.go#L366 The event impl *must* be registered before any connection is opened: https://libvirt.org/html/libvirt-libvirt-event.html#virEventRegisterImpl As a result the event subsystem won't be available when the remote connection is opened, and thus it will not be able to register a callback to receive async events out of band. As a result events will only be delivered at the next synchronous API call you make. This is why you only see the events when you run an API like listing domains.
Daniel, It sounds like you nailed it. I'll give that a shot.