Bug 1335383

Summary:	cannot resume a vm that went to paused state after killing gluster fuse mount process
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	RamaKasturi <knarra>
Component:	core	Assignee:	Ravishankar N <ravishankar>
Status:	CLOSED NOTABUG	QA Contact:	Anoop <annair>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	rhgs-3.1	CC:	knarra, pkarampu, rgowdapp, rhinduja, rhs-bugs, sasundar, storage-qa-internal
Target Milestone:	---	Keywords:	ZStream
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2016-05-25 04:10:50 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1258386

Description RamaKasturi 2016-05-12 07:01:27 UTC

Description of problem:
I have some vms running on a HC setup. On one of my node my glusterfsd,glusterfs and glusterd process were killed due to which the nodes went to paused state. Once all the glusterfs,glusterd and glusterfsd process are up and running i try to resume vms and resuming vms does not work.



Version-Release number of selected component (if applicable):
glusterfs-3.7.9-3.el7rhgs.x86_64


How reproducible:
Always

Steps to Reproduce:
1. Install HC setup
2. Now on one of the hypervisor kill glusterfsd,glusterd and glusterfs process.
3. Bring back all the process up and try to resume the vm from UI.

Actual results:
Vm cannot be resumed.

Expected results:
User should be able to resume the vm.

Additional info:

Comment 2 RamaKasturi 2016-05-12 09:09:15 UTC

vm which i am trying to resume is running on zod.lab.eng.blr.redhat.com and the vm name is BootStrom_windows_vm-6


log snippet from engine logs:
================================
2016-05-12 11:42:04,987 INFO  [org.ovirt.engine.core.bll.RunVmCommand] (ajp-/127.0.0.1:8702-3) [296ab749] Lock Acquired to object 'EngineLock:{exclusiveLocks='[319340d7
-690d-42c5-b583-809cfa03e82e=<VM, ACTION_TYPE_FAILED_OBJECT_LOCKED>]', sharedLocks='null'}'
2016-05-12 11:42:05,161 INFO  [org.ovirt.engine.core.bll.RunVmCommand] (org.ovirt.thread.pool-6-thread-2) [296ab749] Running command: RunVmCommand internal: false. Enti
ties affected :  ID: 319340d7-690d-42c5-b583-809cfa03e82e Type: VMAction group RUN_VM with role type USER
2016-05-12 11:42:05,168 INFO  [org.ovirt.engine.core.vdsbroker.ResumeVDSCommand] (org.ovirt.thread.pool-6-thread-2) [296ab749] START, ResumeVDSCommand( ResumeVDSCommand
Parameters:{runAsync='true', hostId='c7356010-a54c-4848-91c1-6e861dcea129', vmId='319340d7-690d-42c5-b583-809cfa03e82e'}), log id: 2c2a1fea
2016-05-12 11:42:05,170 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ResumeBrokerVDSCommand] (org.ovirt.thread.pool-6-thread-2) [296ab749] START, ResumeBrokerVDSCom
mand(HostName = hosted_engine_3, ResumeVDSCommandParameters:{runAsync='true', hostId='c7356010-a54c-4848-91c1-6e861dcea129', vmId='319340d7-690d-42c5-b583-809cfa03e82e'}), log id: 2d673cd2
2016-05-12 11:42:05,961 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ResumeBrokerVDSCommand] (org.ovirt.thread.pool-6-thread-2) [296ab749] FINISH, ResumeBrokerVDSCommand, log id: 2d673cd2
2016-05-12 11:42:05,961 INFO  [org.ovirt.engine.core.vdsbroker.ResumeVDSCommand] (org.ovirt.thread.pool-6-thread-2) [296ab749] FINISH, ResumeVDSCommand, return: PoweringUp, log id: 2c2a1fea
2016-05-12 11:42:05,962 INFO  [org.ovirt.engine.core.bll.RunVmCommand] (org.ovirt.thread.pool-6-thread-2) [296ab749] Lock freed to object 'EngineLock:{exclusiveLocks='[319340d7-690d-42c5-b583-809cfa03e82e=<VM, ACTION_TYPE_FAILED_OBJECT_LOCKED>]', sharedLocks='null'}'
2016-05-12 11:42:05,978 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-2) [296ab749] Correlation ID: 296ab749, Job ID: e1d832e3-96ad-48c4-b3a2-bfa7ee4c9624, Call Stack: null, Custom Event ID: -1, Message: VM BootStrom_windows_vm-6 was resumed by admin@internal (Host: hosted_engine_3).

Comment 3 SATHEESARAN 2016-05-12 09:32:15 UTC

I tried the same test with libvirt + qemu-kvm + glusterfs-fuse, excluding RHEV.

Versions
glusterfs-3.7.9-4.el7rhgs
RHEV 3.6.5
RHEL 7.2

1. fuse mounted the sharded replica 3 gluster volume 
2. created the VM Image file
3. Installed VM with RHEL 6.5 and booted the VM
4. When the VM is up and running, killed the gluster mount process ( pkill glusterfs )


Observations are,
1. The VM went in to paused state
2. When the volume is mounted back, the VMs continued in paused state
3. Manually resuming the VM too doesn't work ( # virsh resume vm1 )
4. Killed the VM, starting it again helped.

Logs ( /var/log/libvirt/qemu/vm1.log)
--------------------------------------

<snip>
2016-05-12 07:43:42.203+0000: starting up libvirt version: 1.2.17, package: 13.el7_2.4 (Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>, 2016-03-02-11:10:27, x86-034.build.eng.bos.redhat.com), qemu version: 1.5.3 (qemu-kvm-1.5.3-105.el7_2.4)
LC_ALL=C PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin QEMU_AUDIO_DRV=none /usr/libexec/qemu-kvm -name vm1 -S -machine pc-i440fx-rhel7.0.0,accel=kvm,usb=off -cpu SandyBridge -m 4096 -realtime mlock=off -smp 2,sockets=2,cores=1,threads=1 -uuid 90d2e762-04d9-4f5e-b001-152d71cce31e -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-vm1/monitor.sock,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc,driftfix=slew -global kvm-pit.lost_tick_policy=discard -no-hpet -no-shutdown -global PIIX4_PM.disable_s3=1 -global PIIX4_PM.disable_s4=1 -boot strict=on -device ich9-usb-ehci1,id=usb,bus=pci.0,addr=0x5.0x7 -device ich9-usb-uhci1,masterbus=usb.0,firstport=0,bus=pci.0,multifunction=on,addr=0x5 -device ich9-usb-uhci2,masterbus=usb.0,firstport=2,bus=pci.0,addr=0x5.0x1 -device ich9-usb-uhci3,masterbus=usb.0,firstport=4,bus=pci.0,addr=0x5.0x2 -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x4 -drive file=/home/vmstore/vm1.img,if=none,id=drive-virtio-disk0,format=raw,cache=none,werror=stop,rerror=stop -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x6,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -drive if=none,id=drive-ide0-0-0,readonly=on,format=raw -device ide-cd,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0 -netdev tap,fd=24,id=hostnet0,vhost=on,vhostfd=25 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:29:30:8d,bus=pci.0,addr=0x3 -chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 -chardev socket,id=charchannel0,path=/var/lib/libvirt/qemu/channel/target/domain-vm1/org.qemu.guest_agent.0,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=org.qemu.guest_agent.0 -device usb-tablet,id=input0 -vnc 127.0.0.1:0 -vga cirrus -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x7 -msg timestamp=on
char device redirected to /dev/pts/2 (label charserial0)
block I/O error in device 'drive-virtio-disk0': Transport endpoint is not connected (107)
block I/O error in device 'drive-virtio-disk0': Transport endpoint is not connected (107)
block I/O error in device 'drive-virtio-disk0': Transport endpoint is not connected (107)
block I/O error in device 'drive-virtio-disk0': Transport endpoint is not connected (107)
block I/O error in device 'drive-virtio-disk0': Transport endpoint is not connected (107)
block I/O error in device 'drive-virtio-disk0': Transport endpoint is not connected (107)
block I/O error in device 'drive-virtio-disk0': Transport endpoint is not connected (107)
block I/O error in device 'drive-virtio-disk0': Transport endpoint is not connected (107)
block I/O error in device 'drive-virtio-disk0': Transport endpoint is not connected (107)
block I/O error in device 'drive-virtio-disk0': Transport endpoint is not connected (107)
block I/O error in device 'drive-virtio-disk0': Transport endpoint is not connected (107)
block I/O error in device 'drive-virtio-disk0': Transport endpoint is not connected (107)
block I/O error in device 'drive-virtio-disk0': Transport endpoint is not connected (107)
block I/O error in device 'drive-virtio-disk0': Transport endpoint is not connected (107)
block I/O error in device 'drive-virtio-disk0': Transport endpoint is not connected (107)
block I/O error in device 'drive-virtio-disk0': Transport endpoint is not connected (107)
block I/O error in device 'drive-virtio-disk0': Transport endpoint is not connected (107)
block I/O error in device 'drive-virtio-disk0': Transport endpoint is not connected (107)
qemu: terminating on signal 15 from pid 12218
</snip>

Comment 4 SATHEESARAN 2016-05-12 09:33:59 UTC

(In reply to SATHEESARAN from comment #3)
> I tried the same test with libvirt + qemu-kvm + glusterfs-fuse, excluding
> RHEV.
> 
> Versions
> glusterfs-3.7.9-4.el7rhgs
> RHEV 3.6.5
> RHEL 7.2
> 
Mistakenly mentioned RHEV version, there is no RHEV in this test
Adding the qemu, libvirt versions

libvirt-1.2.17-13.el7_2.4.x86_64
qemu-kvm-common-1.5.3-105.el7_2.4.x86_64
qemu-kvm-1.5.3-105.el7_2.4.x86_64

Comment 5 RamaKasturi 2016-05-12 13:52:58 UTC

sos reports can be found in the link below:
==================================================

http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/HC/1335383/

Comment 6 Sahina Bose 2016-05-20 07:47:45 UTC

Pranith - can you check if this is related to Bug 1330044? Here too, one of brick processes is killed

Comment 7 SATHEESARAN 2016-05-20 08:13:15 UTC

(In reply to Sahina Bose from comment #6)
> Pranith - can you check if this is related to Bug 1330044? Here too, one of
> brick processes is killed

For the worth of the information - In this case mount process is killed and again started by mounting ( again )

Comment 8 Pranith Kumar K 2016-05-20 09:05:20 UTC

Nope, this is not because of either EIO/EINVAL. It seems to be because of ENOTCONN.

Comment 10 RamaKasturi 2016-05-24 15:06:44 UTC

Hi Ravi,

    Following is what i did to verify the behaviour of a native XFS mount. wrote a small python script as below and ran the command ./godown /mnt/fio_test to crash XFS. python script failed with INPUT/OUTPUT error. Once the remount happens the script does not continue writing to the file.

f = open ('/mnt/fio_test/test.txt', 'a')
x = 1
while True:
        f.write("To infinity and beyond! We're getting close, on %d now!"  % (x))
        x += 1



Thanks
kasturi

Comment 11 Ravishankar N 2016-05-25 04:10:50 UTC

Thanks for the confirmation Kasturi. Closing the BZ as this seems to be expected behaviour even on on-disk file systems based on comment #10.