Bug 1031943

Summary: QEMU crashes after resume (cont) with gluster backed volumes
Product: Red Hat Enterprise Linux 7 Reporter: Shanzhi Yu <shyu>
Component: qemu-kvmAssignee: Jeff Cody <jcody>
Status: CLOSED WORKSFORME QA Contact: Virtualization Bugs <virt-bugs>
Severity: medium Docs Contact:
Priority: medium    
Version: 7.0CC: dyuan, hhuang, juzhang, mazhang, mzhan, pkrempa, rbalakri, shyu, virt-bugs, virt-maint, xuhan
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-03-03 20:20:57 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
libvirtd log
none
vm log
none
libvirtd.log
none
guest log none

Description Shanzhi Yu 2013-11-19 08:09:27 UTC
Description of problem:

Fail to resume an guest which use glusterfs volume

Version-Release number of selected component (if applicable):

qemu-kvm-rhev-1.5.3-19.el7.x86_64
libvirt-1.1.1-12.el7.x86_64

How reproducible:

100%

Steps to Reproduce:

1. create an guest using glusterfs volume as source disk
# virsh dumpxml rhel6
..
<disk type='network' device='disk'>
      <driver name='qemu' type='qcow2'/>
      <source protocol='gluster' name='gluster-vol1/rhel6-qcow2.img'>
        <host name='10.66.106.22' port='24007' transport='rdma'/>
      </source>
      <target dev='vda' bus='virtio'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x06' function='0x0'/>
    </disk>
    <controller type='scsi' index='0' model='virtio-scsi'>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x08' function='0x0'/>
    </controller>

..

2. suspend the guest

# virsh start rhel6 ;virsh suspend rhel6;virsh list --all
Domain rhel6 started

Domain rhel6 suspended

 Id    Name                           State
----------------------------------------------------
 72    rhel6                          paused


3. resume the paused guest and check guest status

# virsh resume rhel6;virsh list --all

Domain rhel6 resumed

 Id    Name                           State
----------------------------------------------------
      rhel6                          shut off

Notices:

1. If i modify "<target dev='vda' bus='virtio'/>" to 
"<target dev='sda' bus='scsi'/>" or 
delete lines " <controller type='scsi' index='0' model='virtio-scsi'>
..
</controller>" in guest xml,

then I tetest it, there is no problems, guest succeed resuming.

2. If I use file type disk as source, it work well with all situations above.
 



Actual results:

as above

Expected results:

guest should can be resume in step 3.

Additional info:

Comment 2 Peter Krempa 2013-11-19 09:54:37 UTC
Does the same happen if you don't use RDMA transport?

Do you have your InfiniBand connection properly configured?

Please provide debug logs of the libvirt daemon AND the vm log of the VM that crashed/disappeared. ( /var/log/libvirt/qemu/rhel6.log )

Comment 3 Shanzhi Yu 2013-11-19 10:38:36 UTC
(In reply to Peter Krempa from comment #2)
> Does the same happen if you don't use RDMA transport?
> 

yes

> Do you have your InfiniBand connection properly configured?
> 

what does this mean? I can use glusterfs well.

> Please provide debug logs of the libvirt daemon AND the vm log of the VM
> that crashed/disappeared. ( /var/log/libvirt/qemu/rhel6.log )

libvirtd log and vm log is as the attachment.

Comment 4 Shanzhi Yu 2013-11-19 10:40:06 UTC
Created attachment 826007 [details]
libvirtd log

Comment 5 Shanzhi Yu 2013-11-19 10:41:13 UTC
Created attachment 826008 [details]
vm log

Comment 8 Shanzhi Yu 2013-11-19 15:16:21 UTC
Created attachment 826105 [details]
libvirtd.log

Comment 9 Shanzhi Yu 2013-11-19 15:17:12 UTC
Created attachment 826107 [details]
guest log

Comment 10 Peter Krempa 2013-11-22 14:05:30 UTC
According to the libvirtd log qemu crashes:

2013-11-19 15:02:46.129+0000: 19917: debug : qemuMonitorIO:708 : Error on monitor Unable to read from monitor: Connection reset by peer
2013-11-19 15:02:46.129+0000: 19917: debug : virEventPollUpdateHandle:147 : EVENT_POLL_UPDATE_HANDLE: watch=14 events=12
2013-11-19 15:02:46.129+0000: 19917: debug : virEventPollInterruptLocked:710 : Skip interrupt, 1 139653503248512
2013-11-19 15:02:46.129+0000: 19917: debug : virObjectUnref:256 : OBJECT_UNREF: obj=0x7f037c007120
2013-11-19 15:02:46.129+0000: 19917: debug : qemuMonitorIO:731 : Triggering EOF callback
2013-11-19 15:02:46.137+0000: 19917: debug : qemuProcessHandleMonitorEOF:293 : Received EOF on 0x7f037400e870 'rhel6.4'
2013-11-19 15:02:46.137+0000: 19917: debug : qemuProcessHandleMonitorEOF:311 : Monitor connection to 'rhel6.4' closed without SHUTDOWN event; assuming the domain crashed
2013-11-19 15:02:46.137+0000: 19917: debug : virObjectRef:293 : OBJECT_REF: obj=0x7f03801509d0
2013-11-19 15:02:46.137+0000: 19917: debug : qemuProcessStop:4140 : Shutting down VM 'rhel6.4' pid=20171 flags=0

I'm going to re-assign this to the qemu component for further investigation.

I wasn't able to reproduce the issue in my environment thus I can't provide any additional information. Please attach a stack trace of the crashed qemu to aid the qemu developers finding the issue.

Comment 11 Peter Krempa 2013-11-22 14:18:08 UTC
*** Bug 1030749 has been marked as a duplicate of this bug. ***

Comment 12 Ademar Reis 2013-12-10 19:47:48 UTC
Looks like a dupe, or at least related, to bug 1031877

Comment 13 Jeff Cody 2014-01-28 18:35:52 UTC
I have been unable to reproduce this bug in my environment as well, both on a RHEL7 guest and my normal F19 dev machine running RHEL7 qemu binaries.  I've tried both the glusterfs lib version for RHEL7, as well as latest from git, and I still cannot reproduce.

Comment 14 Ademar Reis 2014-01-28 19:11:07 UTC
Both Peter and Jeff failed to reproduce it... Can you test once more and give us more details about your environment?

Comment 15 Shanzhi Yu 2014-02-10 11:38:13 UTC
(In reply to Jeff Cody from comment #13)
> I have been unable to reproduce this bug in my environment as well, both on
> a RHEL7 guest and my normal F19 dev machine running RHEL7 qemu binaries. 
> I've tried both the glusterfs lib version for RHEL7, as well as latest from
> git, and I still cannot reproduce.

Hi Jeff,
I can reproduce it with latest qemu-kvm-rhev & libvirt
Please note that I do test on guest without an healthy os,
I can't reproduce it with an healthy guest.

# rpm -q libvirt qemu-kvm-rhev glusterfs
libvirt-1.1.1-22.el7.x86_64
qemu-kvm-rhev-1.5.3-45.el7.x86_64
glusterfs-3.4.0.59rhs-1.el7.x86_64

1. prepare guest with glusterfs volume as source disk

# virsh dumpxml rhel6|grep disk -A 4
    <disk type='network' device='disk'>
      <driver name='qemu' type='qcow2'/>
      <source protocol='gluster' name='gluster-vol1/test.img'>
        <host name='10.66.5.78' port='24007'/>
      </source>
--
    </disk>
    <controller type='scsi' index='0' model='virtio-scsi'>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x08' function='0x0'/>
    </controller>
# qemu-img info gluster://10.66.5.78/gluster-vol1/test.img
image: gluster://10.66.5.78/gluster-vol1/test.img
file format: qcow2
virtual size: 100G (107374182400 bytes)
disk size: 194K
cluster_size: 65536
Format specific information:
    compat: 1.1
    lazy refcounts: false

2. start guest and suspend/resume it
# virsh list --all
 Id    Name                           State
----------------------------------------------------
 -     rhel6                          shut off
# virsh start rhel6
Domain rhel6 started

# virsh list 
 Id    Name                           State
----------------------------------------------------
 55    rhel6                          running
# virsh suspend rhel6
Domain rhel6 suspended

# virsh list --all
 Id    Name                           State
----------------------------------------------------
 55    rhel6                          paused
# virsh list  --all
 Id    Name                           State
----------------------------------------------------
 -     rhel6                          shut off
3. error info
# grep error /tmp/libvirtd.log 
2014-02-10 11:23:38.288+0000: 26424: error : qemuMonitorIORead:552 : Unable to read from monitor: Connection reset by peer

Comment 16 Ademar Reis 2014-04-18 15:05:19 UTC
*** Bug 1031877 has been marked as a duplicate of this bug. ***

Comment 18 Jeff Cody 2014-07-22 20:29:57 UTC
Shanzhi,

> Please note that I do test on guest without an healthy os,
> I can't reproduce it with an healthy guest.

What do you mean by "healthy" os?

Comment 19 Shanzhi Yu 2014-07-23 02:36:46 UTC
(In reply to Jeff Cody from comment #18)
> Shanzhi,
> 
> > Please note that I do test on guest without an healthy os,
> > I can't reproduce it with an healthy guest.
> 
> What do you mean by "healthy" os?

Install OS(RHEL6.X) on guest and make sure guest is running status

Comment 20 Jeff Cody 2014-11-05 20:13:28 UTC
(In reply to Shanzhi Yu from comment #19)
> (In reply to Jeff Cody from comment #18)
> > Shanzhi,
> > 
> > > Please note that I do test on guest without an healthy os,
> > > I can't reproduce it with an healthy guest.
> > 
> > What do you mean by "healthy" os?
> 
> Install OS(RHEL6.X) on guest and make sure guest is running status

I'm still confused by this differentiation - are you able to reproduce this BZ still?  Can you give me more information by what you mean by healthy vs unhealthy guest?  Thanks!

Comment 21 Shanzhi Yu 2014-11-11 05:24:00 UTC
(In reply to Jeff Cody from comment #20)
> (In reply to Shanzhi Yu from comment #19)
> > (In reply to Jeff Cody from comment #18)
> > > Shanzhi,
> > > 
> > > > Please note that I do test on guest without an healthy os,
> > > > I can't reproduce it with an healthy guest.
> > > 
> > > What do you mean by "healthy" os?
> > 
> > Install OS(RHEL6.X) on guest and make sure guest is running status
> 
> I'm still confused by this differentiation - are you able to reproduce this
> BZ still?  Can you give me more information by what you mean by healthy vs
> unhealthy guest?  Thanks!

Hi Jeff,

Previous, I reproduce it when I try to suspend/resume a guest without OS installed(just define/start a guest with a clean source file).

Current, I fail to reproduce it. 
I use the latest libvirt/qemu version on rhel7

# rpm -q libvirt qemu-kvm-rhev glusterfs
libvirt-1.2.8-6.el7.x86_64
qemu-kvm-rhev-2.1.2-7.el7.x86_64
glusterfs-3.6.0.29-2.el7.x86_64

Comment 22 Jeff Cody 2015-03-03 20:20:57 UTC
Closing this, as we are unable to reproduce.