Bug 666158

Summary: domain suspension followed by two conflicting events
Product: Red Hat Enterprise Linux 6 Reporter: Dan Kenigsberg <danken>
Component: libvirtAssignee: Jiri Denemark <jdenemar>
Status: CLOSED ERRATA QA Contact: Virtualization Bugs <virt-bugs>
Severity: high Docs Contact:
Priority: high    
Version: 6.0CC: abaron, berrange, dallan, dyuan, eblake, mjenner, nzhang, vbian, xen-maint, yoyzhang
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: libvirt-0.8.7-3.el6 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-05-19 13:25:21 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 624252, 672492, 672725, 678524    
Bug Blocks: 682015    
Attachments:
Description Flags
some context to the said event
none
slightly modified libvirtev.py to print all events none

Description Dan Kenigsberg 2010-12-29 08:51:09 UTC
Description of problem:
Once in a while, suspending a domain ends with 
VIR_DOMAIN_EVENT_STOPPED + VIR_DOMAIN_EVENT_STOPPED_SAVED
followed by
VIR_DOMAIN_EVENT_STOPPED + VIR_DOMAIN_EVENT_STOPPED_FAILED

Version-Release number of selected component (if applicable):


How reproducible:
seldom

Additional info:
10:42:49.999: debug : remoteRelayDomainEventLifecycle:118 : Relaying domain lifecycle event 5 4
10:42:49.999: debug : virDomainFree:2218 : domain=0x7f3448048850
10:42:49.999: debug : remoteRelayDomainEventLifecycle:118 : Relaying domain lifecycle event 5 5
10:42:49.999: debug : virDomainFree:2218 : domain=0x7f3448048850

Comment 1 Dan Kenigsberg 2010-12-29 08:53:56 UTC
Created attachment 471045 [details]
some context to the said event

Comment 2 Daniel Berrangé 2011-01-05 10:59:00 UTC
The first SAVED event is emitted in qemudDomainSaveFlag().  I think that while qemudDomainSaveFlag is running and holds the lock on the domain object, a monitor EOF event arrives which then causes qemuHandleMonitorEOF() to run emitting a FAILED event. qemuHandleMonitorEOF() probably needs to do as if (virDomainIsActive()) check before emitting the event.

Comment 3 Dave Allan 2011-01-05 21:29:35 UTC
Dan, you don't sound entirely convinced of your analysis.  How do you want to proceed?

Comment 4 Dave Allan 2011-01-05 21:30:27 UTC
(In reply to comment #3)
> Dan, you don't sound entirely convinced of your analysis.  How do you want to
> proceed?

Just to clarify, that's Dan B. I'm asking.

Comment 5 Daniel Berrangé 2011-01-06 10:48:29 UTC
This is simply a hypothesis based on looking at the code & DanK's logfile. It would of course need investigation & testing to see if its correct.

Comment 6 Dave Allan 2011-01-06 19:39:49 UTC
Dan K, is it something that happens at a constant rate; i.e., if you suspend and resume a domain in a loop do you expect to see the incorrect event periodically?  I'm trying to figure out how we can test Dan B's hypothesis.

Comment 7 Dan Kenigsberg 2011-01-06 21:10:43 UTC
(In reply to comment #6)
> Dan K, is it something that happens at a constant rate; i.e., if you suspend
> and resume a domain in a loop do you expect to see the incorrect event
> periodically?  I'm trying to figure out how we can test Dan B's hypothesis.

Yes, at least in vdsm environment, Igor experienced it quite often, running http://git.engineering.redhat.com/?p=users/dkenigsb/vdsm.git;a=blob;f=vdsm/storage/ut/multiVmTests.py

Comment 8 Jiri Denemark 2011-01-19 19:31:20 UTC
I think the hypothesis in comment 2 is right. By putting sleep() inside qemudShutdownVMDaemon(), I was able to get two STOPPED events in a row for a single domain. The patch, that fixes this was sent upstream for review: https://www.redhat.com/archives/libvir-list/2011-January/msg00818.html

Comment 9 Jiri Denemark 2011-01-19 19:57:32 UTC
The patch is now upstream and its backport sent for internal review

Comment 11 Vivian Bian 2011-01-21 12:25:40 UTC
Hi Jiri, 
failed to get a way to reproduce this bug with the old version of libvirt . Would you please give me some suggestion on how to reproduce it ?

Thanks
Vivian

Comment 12 Dave Allan 2011-01-21 18:22:25 UTC
Vivian, please see comment #7 for a reproducer script.  You need to read the full transcript before asking for help reproducing a bug.  Often the answer is there.

Comment 13 Vivian Bian 2011-01-25 08:50:17 UTC
Hi Dan, 
According to the following reasons , would you please help verify this bug  :
1. We don't have the RUTH environment in China. 
2. failed to reproduce this bug without RUTH environment.
3. the reproducer script in comment #7 is from RUTH . I managed to make the script run without error output. But didn't get any output from the script ,either. 

Thanks 
Vivian

Comment 14 Dan Kenigsberg 2011-01-26 09:56:45 UTC
(In reply to comment #13)

Please run the attached script. It registers to libvirt and prints received events.

python libvirtev.py | tee /tmp/log

In another console run save/restore in a tight loop. Verify that "save" is followed by only one Stop event.

Comment 15 Dan Kenigsberg 2011-01-26 09:58:04 UTC
Created attachment 475358 [details]
slightly modified libvirtev.py to print all events

Comment 16 Vivian Bian 2011-02-15 08:05:37 UTC
can't reproduce this bug with the comment 14 and comment 15 suggestion , but met bug 672725 (the blocker) instead . Will retest this bug after the blocker bug gets fixed . 

Also tried the script on comment 7 , could not reproduce this bug with RUTH environment with the old version . Is there any special profile requested for the reproducer machine ? 

Dan would you please show me more suggestion on this bug ? Thanks !

Comment 18 Vivian Bian 2011-03-10 10:53:41 UTC
add blocker bugs 624252 678524 , because if we don't get 678524 fixed , we can't completely make the blocker bug 672725 be without error report like 
error: Failed to restore domain from /opt/test.save
error: operation failed: failed to read qemu header
if we don't have 624252 fixed this bug can't be verified on vdsm RUTH system

Comment 19 zhanghaiyan 2011-03-16 08:09:05 UTC
Cannot reproduced this bug both with older libvirt-0.8.1-27.el6.x86_64.rpm and new libvirt-0.8.7-12.el6.x86_64

In plain libvirt env, execute the reproducer in comment 14. Save/restore guests for 200 times. No twice save event is found.

# python libvirtev.py
Using uri:qemu:///system
.....
myDomainEventCallback2 EVENT: Domain new(-1) Stopped 4
myDomainEventCallback2 EVENT: Domain new(44) Started 2
myDomainEventCallback2 EVENT: Domain new(-1) Stopped 4
myDomainEventCallback2 EVENT: Domain new(45) Started 2
myDomainEventCallback2 EVENT: Domain new(-1) Stopped 4
myDomainEventCallback2 EVENT: Domain new(46) Started 2
myDomainEventCallback2 EVENT: Domain new(-1) Stopped 4
.....

Comment 25 errata-xmlrpc 2011-05-19 13:25:21 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-0596.html