Description of problem: I have some vms running on a HC setup. On one of my node my glusterfsd,glusterfs and glusterd process were killed due to which the nodes went to paused state. Once all the glusterfs,glusterd and glusterfsd process are up and running i try to resume vms and resuming vms does not work. Version-Release number of selected component (if applicable): glusterfs-3.7.9-3.el7rhgs.x86_64 How reproducible: Always Steps to Reproduce: 1. Install HC setup 2. Now on one of the hypervisor kill glusterfsd,glusterd and glusterfs process. 3. Bring back all the process up and try to resume the vm from UI. Actual results: Vm cannot be resumed. Expected results: User should be able to resume the vm. Additional info:
vm which i am trying to resume is running on zod.lab.eng.blr.redhat.com and the vm name is BootStrom_windows_vm-6 log snippet from engine logs: ================================ 2016-05-12 11:42:04,987 INFO [org.ovirt.engine.core.bll.RunVmCommand] (ajp-/127.0.0.1:8702-3) [296ab749] Lock Acquired to object 'EngineLock:{exclusiveLocks='[319340d7 -690d-42c5-b583-809cfa03e82e=<VM, ACTION_TYPE_FAILED_OBJECT_LOCKED>]', sharedLocks='null'}' 2016-05-12 11:42:05,161 INFO [org.ovirt.engine.core.bll.RunVmCommand] (org.ovirt.thread.pool-6-thread-2) [296ab749] Running command: RunVmCommand internal: false. Enti ties affected : ID: 319340d7-690d-42c5-b583-809cfa03e82e Type: VMAction group RUN_VM with role type USER 2016-05-12 11:42:05,168 INFO [org.ovirt.engine.core.vdsbroker.ResumeVDSCommand] (org.ovirt.thread.pool-6-thread-2) [296ab749] START, ResumeVDSCommand( ResumeVDSCommand Parameters:{runAsync='true', hostId='c7356010-a54c-4848-91c1-6e861dcea129', vmId='319340d7-690d-42c5-b583-809cfa03e82e'}), log id: 2c2a1fea 2016-05-12 11:42:05,170 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.ResumeBrokerVDSCommand] (org.ovirt.thread.pool-6-thread-2) [296ab749] START, ResumeBrokerVDSCom mand(HostName = hosted_engine_3, ResumeVDSCommandParameters:{runAsync='true', hostId='c7356010-a54c-4848-91c1-6e861dcea129', vmId='319340d7-690d-42c5-b583-809cfa03e82e'}), log id: 2d673cd2 2016-05-12 11:42:05,961 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.ResumeBrokerVDSCommand] (org.ovirt.thread.pool-6-thread-2) [296ab749] FINISH, ResumeBrokerVDSCommand, log id: 2d673cd2 2016-05-12 11:42:05,961 INFO [org.ovirt.engine.core.vdsbroker.ResumeVDSCommand] (org.ovirt.thread.pool-6-thread-2) [296ab749] FINISH, ResumeVDSCommand, return: PoweringUp, log id: 2c2a1fea 2016-05-12 11:42:05,962 INFO [org.ovirt.engine.core.bll.RunVmCommand] (org.ovirt.thread.pool-6-thread-2) [296ab749] Lock freed to object 'EngineLock:{exclusiveLocks='[319340d7-690d-42c5-b583-809cfa03e82e=<VM, ACTION_TYPE_FAILED_OBJECT_LOCKED>]', sharedLocks='null'}' 2016-05-12 11:42:05,978 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-2) [296ab749] Correlation ID: 296ab749, Job ID: e1d832e3-96ad-48c4-b3a2-bfa7ee4c9624, Call Stack: null, Custom Event ID: -1, Message: VM BootStrom_windows_vm-6 was resumed by admin@internal (Host: hosted_engine_3).
I tried the same test with libvirt + qemu-kvm + glusterfs-fuse, excluding RHEV. Versions glusterfs-3.7.9-4.el7rhgs RHEV 3.6.5 RHEL 7.2 1. fuse mounted the sharded replica 3 gluster volume 2. created the VM Image file 3. Installed VM with RHEL 6.5 and booted the VM 4. When the VM is up and running, killed the gluster mount process ( pkill glusterfs ) Observations are, 1. The VM went in to paused state 2. When the volume is mounted back, the VMs continued in paused state 3. Manually resuming the VM too doesn't work ( # virsh resume vm1 ) 4. Killed the VM, starting it again helped. Logs ( /var/log/libvirt/qemu/vm1.log) -------------------------------------- <snip> 2016-05-12 07:43:42.203+0000: starting up libvirt version: 1.2.17, package: 13.el7_2.4 (Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>, 2016-03-02-11:10:27, x86-034.build.eng.bos.redhat.com), qemu version: 1.5.3 (qemu-kvm-1.5.3-105.el7_2.4) LC_ALL=C PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin QEMU_AUDIO_DRV=none /usr/libexec/qemu-kvm -name vm1 -S -machine pc-i440fx-rhel7.0.0,accel=kvm,usb=off -cpu SandyBridge -m 4096 -realtime mlock=off -smp 2,sockets=2,cores=1,threads=1 -uuid 90d2e762-04d9-4f5e-b001-152d71cce31e -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-vm1/monitor.sock,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc,driftfix=slew -global kvm-pit.lost_tick_policy=discard -no-hpet -no-shutdown -global PIIX4_PM.disable_s3=1 -global PIIX4_PM.disable_s4=1 -boot strict=on -device ich9-usb-ehci1,id=usb,bus=pci.0,addr=0x5.0x7 -device ich9-usb-uhci1,masterbus=usb.0,firstport=0,bus=pci.0,multifunction=on,addr=0x5 -device ich9-usb-uhci2,masterbus=usb.0,firstport=2,bus=pci.0,addr=0x5.0x1 -device ich9-usb-uhci3,masterbus=usb.0,firstport=4,bus=pci.0,addr=0x5.0x2 -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x4 -drive file=/home/vmstore/vm1.img,if=none,id=drive-virtio-disk0,format=raw,cache=none,werror=stop,rerror=stop -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x6,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -drive if=none,id=drive-ide0-0-0,readonly=on,format=raw -device ide-cd,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0 -netdev tap,fd=24,id=hostnet0,vhost=on,vhostfd=25 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:29:30:8d,bus=pci.0,addr=0x3 -chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 -chardev socket,id=charchannel0,path=/var/lib/libvirt/qemu/channel/target/domain-vm1/org.qemu.guest_agent.0,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=org.qemu.guest_agent.0 -device usb-tablet,id=input0 -vnc 127.0.0.1:0 -vga cirrus -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x7 -msg timestamp=on char device redirected to /dev/pts/2 (label charserial0) block I/O error in device 'drive-virtio-disk0': Transport endpoint is not connected (107) block I/O error in device 'drive-virtio-disk0': Transport endpoint is not connected (107) block I/O error in device 'drive-virtio-disk0': Transport endpoint is not connected (107) block I/O error in device 'drive-virtio-disk0': Transport endpoint is not connected (107) block I/O error in device 'drive-virtio-disk0': Transport endpoint is not connected (107) block I/O error in device 'drive-virtio-disk0': Transport endpoint is not connected (107) block I/O error in device 'drive-virtio-disk0': Transport endpoint is not connected (107) block I/O error in device 'drive-virtio-disk0': Transport endpoint is not connected (107) block I/O error in device 'drive-virtio-disk0': Transport endpoint is not connected (107) block I/O error in device 'drive-virtio-disk0': Transport endpoint is not connected (107) block I/O error in device 'drive-virtio-disk0': Transport endpoint is not connected (107) block I/O error in device 'drive-virtio-disk0': Transport endpoint is not connected (107) block I/O error in device 'drive-virtio-disk0': Transport endpoint is not connected (107) block I/O error in device 'drive-virtio-disk0': Transport endpoint is not connected (107) block I/O error in device 'drive-virtio-disk0': Transport endpoint is not connected (107) block I/O error in device 'drive-virtio-disk0': Transport endpoint is not connected (107) block I/O error in device 'drive-virtio-disk0': Transport endpoint is not connected (107) block I/O error in device 'drive-virtio-disk0': Transport endpoint is not connected (107) qemu: terminating on signal 15 from pid 12218 </snip>
(In reply to SATHEESARAN from comment #3) > I tried the same test with libvirt + qemu-kvm + glusterfs-fuse, excluding > RHEV. > > Versions > glusterfs-3.7.9-4.el7rhgs > RHEV 3.6.5 > RHEL 7.2 > Mistakenly mentioned RHEV version, there is no RHEV in this test Adding the qemu, libvirt versions libvirt-1.2.17-13.el7_2.4.x86_64 qemu-kvm-common-1.5.3-105.el7_2.4.x86_64 qemu-kvm-1.5.3-105.el7_2.4.x86_64
sos reports can be found in the link below: ================================================== http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/HC/1335383/
Pranith - can you check if this is related to Bug 1330044? Here too, one of brick processes is killed
(In reply to Sahina Bose from comment #6) > Pranith - can you check if this is related to Bug 1330044? Here too, one of > brick processes is killed For the worth of the information - In this case mount process is killed and again started by mounting ( again )
Nope, this is not because of either EIO/EINVAL. It seems to be because of ENOTCONN.
Hi Ravi, Following is what i did to verify the behaviour of a native XFS mount. wrote a small python script as below and ran the command ./godown /mnt/fio_test to crash XFS. python script failed with INPUT/OUTPUT error. Once the remount happens the script does not continue writing to the file. f = open ('/mnt/fio_test/test.txt', 'a') x = 1 while True: f.write("To infinity and beyond! We're getting close, on %d now!" % (x)) x += 1 Thanks kasturi
Thanks for the confirmation Kasturi. Closing the BZ as this seems to be expected behaviour even on on-disk file systems based on comment #10.