Description of problem: VM state is 'Paused' after messing with vdsm on a host eventhe VM is effectively Up. It is odd. How did that happened? It's side-effect of BZ1131545. Stopping supervdsmd/vdsdm and waiting a while... but the VM was running all the time and my SSH connection into VM was working all the time. Engine incorrectly detects the VM state. Version-Release number of selected component (if applicable): ovirt-engine-backend-3.5.0-0.0.master.20140804172041.git23b558e.el6.noarch How reproducible: 100% Steps to Reproduce: 1. stop supervdsmd/vdsmd on a host while having a VM running 2. wait 3. Actual results: VM status changes to 'Paused' in a while but the VM is effectively still up and working OK Expected results: show reality Additional info: see the qemu-kvm process is "older" than vdsm stuff executed by ssh soft fencing... # ps aux | egrep '[v]dsmd|[q]emu-kvm' qemu 2937 1.1 5.2 1810716 424208 ? Sl 16:00 0:24 /usr/libexec/qemu-kvm -name jb-rhel7 -S -M rhel6.5.0 -cpu SandyBridge -enable-kvm -m 1024 -realtime mlock=off -smp 1,maxcpus=160,sockets=160,cores=1,threads=1 -uuid f8262f1c-71d3-4b0a-b14d-3bb060d89c61 -smbios type=1,manufacturer=oVirt,product=oVirt Node,version=6Server-6.5.0.1.el6,serial=4C4C4544-0058-3410-8058-C2C04F38354A,uuid=f8262f1c-71d3-4b0a-b14d-3bb060d89c61 -nodefconfig -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/jb-rhel7.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=2014-08-19T14:00:34,driftfix=slew -no-kvm-pit-reinjection -no-shutdown -device ich9-usb-ehci1,id=usb,bus=pci.0,addr=0x5.0x7 -device ich9-usb-uhci3,masterbus=usb.0,firstport=4,bus=pci.0,addr=0x5.0x2 -device ich9-usb-uhci1,masterbus=usb.0,firstport=0,bus=pci.0,multifunction=on,addr=0x5 -device ich9-usb-uhci2,masterbus=usb.0,firstport=2,bus=pci.0,addr=0x5.0x1 -device virtio-scsi-pci,id=scsi0,bus=pci.0,addr=0x6 -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x7 -drive if=none,media=cdrom,id=drive-ide0-1-0,readonly=on,format=raw,serial= -device ide-drive,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -drive file=/rhev/data-center/00000002-0002-0002-0002-000000000038/f3298d58-7f41-4d71-8a06-560a69451ab0/images/8e03de77-e237-49ca-b8ac-46cf283e78b7/c6ca3708-8adb-4d7e-8f88-a81761539dcf,if=none,id=drive-virtio-disk0,format=qcow2,serial=8e03de77-e237-49ca-b8ac-46cf283e78b7,cache=none,werror=stop,rerror=stop,aio=threads -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x8,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -netdev tap,fd=28,id=hostnet0,vhost=on,vhostfd=29 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=00:1a:4a:32:a5:00,bus=pci.0,addr=0x3 -chardev socket,id=charchannel0,path=/var/lib/libvirt/qemu/channels/f8262f1c-71d3-4b0a-b14d-3bb060d89c61.com.redhat.rhevm.vdsm,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=com.redhat.rhevm.vdsm -chardev socket,id=charchannel1,path=/var/lib/libvirt/qemu/channels/f8262f1c-71d3-4b0a-b14d-3bb060d89c61.org.qemu.guest_agent.0,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=2,chardev=charchannel1,id=channel1,name=org.qemu.guest_agent.0 -chardev spicevmc,id=charchannel2,name=vdagent -device virtserialport,bus=virtio-serial0.0,nr=3,chardev=charchannel2,id=channel2,name=com.redhat.spice.0 -spice port=5900,tls-port=5901,addr=10.34.63.223,x509-dir=/etc/pki/vdsm/libvirt-spice,tls-channel=main,tls-channel=display,tls-channel=inputs,tls-channel=cursor,tls-channel=playback,tls-channel=record,tls-channel=smartcard,tls-channel=usbredir,seamless-migration=on -k en-us -vga qxl -global qxl-vga.ram_size=67108864 -global qxl-vga.vram_size=33554432 -device intel-hda,id=sound0,bus=pci.0,addr=0x4 -device hda-duplex,id=sound0-codec0,bus=sound0.0,cad=0 -chardev spicevmc,id=charredir0,name=usbredir -device usb-redir,chardev=charredir0,id=redir0 -chardev spicevmc,id=charredir1,name=usbredir -device usb-redir,chardev=charredir1,id=redir1 -chardev spicevmc,id=charredir2,name=usbredir -device usb-redir,chardev=charredir2,id=redir2 -chardev spicevmc,id=charredir3,name=usbredir -device usb-redir,chardev=charredir3,id=redir3 -incoming tcp:[::]:49153 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x9 root 3229 0.0 0.0 11340 688 ? S< 16:02 0:00 /bin/bash -e /usr/share/vdsm/respawn --minlifetime 10 --daemon --masterpid /var/run/vdsm/supervdsm_respawn.pid /usr/share/vdsm/supervdsmServer --sockfile /var/run/vdsm/svdsm.sock --pidfile /var/run/vdsm/supervdsmd.pid root 3232 0.0 0.3 654280 28548 ? S<l 16:02 0:00 /usr/bin/python /usr/share/vdsm/supervdsmServer --sockfile /var/run/vdsm/svdsm.sock --pidfile /var/run/vdsm/supervdsmd.pid vdsm 3326 0.0 0.0 9236 680 ? S< 16:03 0:00 /bin/bash -e /usr/share/vdsm/respawn --minlifetime 10 --daemon --masterpid /var/run/vdsm/respawn.pid /usr/share/vdsm/vdsm --pidfile /var/run/vdsm/vdsmd.pid vdsm 3329 4.5 0.7 2871756 62012 ? S<l 16:03 1:32 /usr/bin/python /usr/share/vdsm/vdsm --pidfile /var/run/vdsm/vdsmd.pid
Created attachment 928390 [details] engine.log
Created attachment 928391 [details] vdsm.log
clicking on 'Run' button in Admin Portal makes VM state to reflect reality and console opening works.
looks like an issue with vdsm recovery? i was able to get the scenario with just start and stop to vdsm service: [root@dubi ~]# vdsClient 0 list table f1f7eba7-4c11-4de5-b69a-e2ed1e98f672 20182 v1 Up [root@dubi ~]# service vdsmd stop Redirecting to /bin/systemctl stop vdsmd.service [root@dubi ~]# service vdsmd start Redirecting to /bin/systemctl start vdsmd.service [root@dubi ~]# vdsClient 0 list table f1f7eba7-4c11-4de5-b69a-e2ed1e98f672 20182 v1 Paused [root@dubi ~]# virsh -r list Id Name State ---------------------------------------------------- 2 v1 running
There is something fishy going on here. First thing, is not that bad: not 100% reproduceable all the time one restart VDSM while VM are running. The following on a RHEL7 host of mine: GENji> 13:19:44 root [~]$ vdsClient 0 list table b63bfabf-6be6-4273-b133-3f8a068f52ac 5325 SuperTiny_C0 Up GENji> 13:19:46 root [~]$ service vdsmd stop Redirecting to /bin/systemctl stop vdsmd.service GENji> 13:20:01 root [~]$ service vdsmd start Redirecting to /bin/systemctl start vdsmd.service GENji> 13:20:07 root [~]$ vdsClient 0 list table b63bfabf-6be6-4273-b133-3f8a068f52ac 5325 SuperTiny_C0 Up GENji> 13:20:12 root [~]$ virsh -r list Id Name State ---------------------------------------------------- 2 SuperTiny_C0 running I posted http://gerrit.ovirt.org/#/c/31697/1 tho help understanding this issue. We need a clear(er) reproduceable case to go ahead. Still investigating.
The previous comment was using VDSM master. Using VDSM 4.16.1-git: $ grep 'I am' /var/log/vdsm//vdsm.log MainThread::INFO::2014-08-20 13:40:57,827::vdsm::131::vds::(run) (PID: 22719) I am the actual vdsm 4.16.1-32.git2ea9170.el7 genji.rokugan.lan (3.10.0-123.el7.x86_64) same result GENji> 13:41:47 root [~]$ vdsClient 0 list table 56d1c657-dd76-4609-a207-c050699be5be 23247 F20_C1 Up GENji> 13:41:49 root [~]$ service vdsmd stop Redirecting to /bin/systemctl stop vdsmd.service GENji> 13:42:07 root [~]$ vdsClient 0 list table Connection to 0.0.0.0:54321 refused GENji> 13:42:10 root [~]$ virsh -r list Id Name State ---------------------------------------------------- 3 F20_C1 running GENji> 13:42:15 root [~]$ service vdsmd start Redirecting to /bin/systemctl start vdsmd.service GENji> 13:42:21 root [~]$ vdsClient 0 list table 56d1c657-dd76-4609-a207-c050699be5be 23247 F20_C1 Up GENji> 13:42:53 root [~]$ virsh -r list Id Name State ---------------------------------------------------- 3 F20_C1 running GENji> 13:42:56 root [~]$ sleep 5m && vdsClient 0 list table 56d1c657-dd76-4609-a207-c050699be5be 23247 F20_C1 Up GENji> 13:48:08 root [~]$ I'll keep this BZ open to track this issue, but we definitely need more data.
FYI I stated that you have to stop both supervdsmd and vdsmd and *wait*... not sure when it was discovered/set to 'Paused', see logs.
(In reply to Jiri Belka from comment #7) > FYI I stated that you have to stop both supervdsmd and vdsmd and *wait*... > not sure when it was discovered/set to 'Paused', see logs. Fair enough, I can easily try this way, so I'll do that
vdsm-4.16.1-6.gita4a4614.el6.x86_64 on the host (10.34.63.223)
*** Bug 1131567 has been marked as a duplicate of this bug. ***
The recovery Issue Omer reported into https://bugzilla.redhat.com/show_bug.cgi?id=1131548#c6 was fixed into http://gerrit.ovirt.org/#/c/31047/ which should appear into VDSM 4.16.2. So, no action needed. As for the issue Jiri originally reported, I just realized that if the VDSM *stays* down and it is not just restared, then Engine comes into play and may decide to place the VM in Paused (or in Unknown or something else). I see some suspect lines into the Engine logs: 2014-08-19 16:03:05,340 DEBUG [org.ovirt.engine.core.bll.HandleVdsCpuFlagsOrClusterChangedCommand] (DefaultQuartzScheduler_Worker-7) [79166c6] Permission check skipped for internal action HandleVdsCpuFlagsOrClusterChanged. 2014-08-19 16:03:05,343 INFO [org.ovirt.engine.core.bll.HandleVdsCpuFlagsOrClusterChangedCommand] (DefaultQuartzScheduler_Worker-7) [79166c6] Running command: HandleVdsCpuFlagsOrClusterChangedCommand(RunSilent = false, VdsId = 427d5161-0ec8-442c-9377-6cfe0588c642) internal: true. Entities affected : ID: 427d5161-0ec8-442c-9377-6cfe0588c642 Type: VDS 2014-08-19 16:03:05,341 DEBUG [org.ovirt.engine.core.bll.lock.InMemoryLockManager] (DefaultQuartzScheduler_Worker-53) [51e5d497] Successed acquiring lock EngineLock [exclusiveLocks= key: 396ab187-c838-42dc-94f2-85de7221b874 value: VDS_INIT , sharedLocks= ] succeeded 2014-08-19 16:03:05,358 INFO [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-7) [79166c6] VM jb-rhel7 f8262f1c-71d3-4b0a-b14d-3bb060d89c61 moved from Unknown --> Paused and 2014-08-19 14:32:16,334 ERROR [org.ovirt.engine.core.vdsbroker.ResumeVDSCommand] (org.ovirt.thread.pool-8-thread-10) [16708786] VDS::pause Failed resume vm f8262f1c-71d3-4b0a-b14d-3bb060d89c61 in vds = 396ab187-c838-42dc-94f2-85de7221b874 : dell-r210ii-03, error = org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to ResumeBrokerVDS, error = Unexpected exception, code = 16 2014-08-19 14:32:16,335 INFO [org.ovirt.engine.core.vdsbroker.ResumeVDSCommand] (org.ovirt.thread.pool-8-thread-10) [16708786] FINISH, ResumeVDSCommand, return: Unknown, log id: 55985742 2014-08-19 14:32:16,336 DEBUG [org.ovirt.engine.core.bll.lock.InMemoryLockManager] (org.ovirt.thread.pool-8-thread-10) [16708786] Before releasing a lock EngineLock [exclusiveLocks= key: f8262f1c-71d3-4b0a-b14d-3bb060d89c61 value: VM , sharedLocks= ] 2014-08-19 14:32:16,337 DEBUG [org.ovirt.engine.core.bll.lock.InMemoryLockManager] (org.ovirt.thread.pool-8-thread-10) [16708786] The exclusive lock for key f8262f1c-71d3-4b0a-b14d-3bb060d89c61VM is released and lock is removed from map 2014-08-19 14:32:16,338 INFO [org.ovirt.engine.core.bll.RunVmCommand] (org.ovirt.thread.pool-8-thread-10) [16708786] Lock freed to object EngineLock [exclusiveLocks= key: f8262f1c-71d3-4b0a-b14d-3bb060d89c61 value: VM , sharedLocks= ] 2014-08-19 14:32:16,339 ERROR [org.ovirt.engine.core.bll.RunVmCommand] (org.ovirt.thread.pool-8-thread-10) [16708786] Command org.ovirt.engine.core.bll.RunVmCommand throw Vdc Bll exception. With error message VdcBLLException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to ResumeBrokerVDS, error = Unexpected exception, code = 16 (Failed with error unexpected and code 16) 2014-08-19 14:32:16,341 DEBUG [org.ovirt.engine.core.bll.RunVmCommand] (org.ovirt.thread.pool-8-thread-10) [16708786] Command org.ovirt.engine.core.bll.RunVmCommand throw Vdc Bll exception: org.ovirt.engine.core.common.errors.VdcBLLException: VdcBLLException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to ResumeBrokerVDS, error = Unexpected exception, code = 16 (Failed with error unexpected and code 16) at org.ovirt.engine.core.bll.VdsHandler.handleVdsResult(VdsHandler.java:117) [bll.jar:] at org.ovirt.engine.core.bll.VDSBrokerFrontendImpl.RunVdsCommand(VDSBrokerFrontendImpl.java:33) [bll.jar:] at org.ovirt.engine.core.bll.VDSBrokerFrontendImpl.RunAsyncVdsCommand(VDSBrokerFrontendImpl.java:48) [bll.jar:] at org.ovirt.engine.core.bll.RunVmCommand.resumeVm(RunVmCommand.java:207) [bll.jar:] at org.ovirt.engine.core.bll.RunVmCommand.perform(RunVmCommand.java:332) [bll.jar:] at org.ovirt.engine.core.bll.RunVmCommand.executeVmCommand(RunVmCommand.java:274) [bll.jar:] at org.ovirt.engine.core.bll.VmCommand.executeCommand(VmCommand.java:99) [bll.jar:] at org.ovirt.engine.core.bll.CommandBase.executeWithoutTransaction(CommandBase.java:1135) [bll.jar:] at org.ovirt.engine.core.bll.CommandBase.executeActionInTransactionScope(CommandBase.java:1254) [bll.jar:] at org.ovirt.engine.core.bll.CommandBase.runInTransaction(CommandBase.java:1874) [bll.jar:] at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInSuppressed(TransactionSupport.java:174) [utils.jar:] at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInScope(TransactionSupport.java:116) [utils.jar:] at org.ovirt.engine.core.bll.CommandBase.execute(CommandBase.java:1275) [bll.jar:] at org.ovirt.engine.core.bll.CommandBase.executeAction(CommandBase.java:345) [bll.jar:] at org.ovirt.engine.core.bll.MultipleActionsRunner.executeValidatedCommand(MultipleActionsRunner.java:191) [bll.jar:] at org.ovirt.engine.core.bll.MultipleActionsRunner.runCommands(MultipleActionsRunner.java:158) [bll.jar:] at org.ovirt.engine.core.bll.SortedMultipleActionsRunnerBase.runCommands(SortedMultipleActionsRunnerBase.java:20) [bll.jar:] at org.ovirt.engine.core.bll.MultipleActionsRunner$2.run(MultipleActionsRunner.java:167) [bll.jar:] at org.ovirt.engine.core.utils.threadpool.ThreadPoolUtil$InternalWrapperRunnable.run(ThreadPoolUtil.java:90) [utils.jar:] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) [rt.jar:1.7.0_55] at java.util.concurrent.FutureTask.run(FutureTask.java:262) [rt.jar:1.7.0_55] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [rt.jar:1.7.0_55] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [rt.jar:1.7.0_55] at java.lang.Thread.run(Thread.java:744) [rt.jar:1.7.0_55] Omer, do you think the above sentences makes sense?
when vdsm is stopped, it is treated as non-resposive from the engine point-of-view, so when no power-mgmt defiend, the vms are marked as 'unknown'. once vdsm responds again, we get the new status, this is when engine get 'paused' and you see in the log: '...moved from Unknown --> Paused' after that, when user try to start the paused vm, vdsm returns error (" Failed with error unexpected and code 16") but then the vm returns to up. so i believe its all the same case, and there is no other issue here
(In reply to Omer Frenkel from comment #12) > so i believe its all the same case, and there is no other issue here I now believe so as well since I found this on VDSM logs Jiri provided Thread-13::DEBUG::2014-08-19 15:34:23,100::vm::1357::vm.Vm::(blockDev) vmId=`f8262f1c-71d3-4b0a-b14d-3bb060d89c61`::Unable to determine if the path '/rhev/data-center/00000002-0002-0002-0002-000000000038/f3298d58-7f41-4d71-8a06-560a69451ab0/images/8e03de77-e237-49ca-b8ac-46cf283e78b7/c6ca3708-8adb-4d7e-8f88-a81761539dcf' is a block device Traceback (most recent call last): File "/usr/share/vdsm/virt/vm.py", line 1354, in blockDev self._blockDev = utils.isBlockDevice(self.path) File "/usr/lib64/python2.6/site-packages/vdsm/utils.py", line 104, in isBlockDevice return stat.S_ISBLK(os.stat(path).st_mode) OSError: [Errno 2] No such file or directory: '/rhev/data-center/00000002-0002-0002-0002-000000000038/f3298d58-7f41-4d71-8a06-560a69451ab0/images/8e03de77-e237-49ca-b8ac-46cf283e78b7/c6ca3708-8adb-4d7e-8f88-a81761539dcf' Thread-13::INFO::2014-08-19 15:34:23,101::vm::2251::vm.Vm::(_startUnderlyingVm) vmId=`f8262f1c-71d3-4b0a-b14d-3bb060d89c61`::Skipping errors on recovery Traceback (most recent call last): File "/usr/share/vdsm/virt/vm.py", line 2235, in _startUnderlyingVm self._run() File "/usr/share/vdsm/virt/vm.py", line 3298, in _run self._domDependentInit() File "/usr/share/vdsm/virt/vm.py", line 3189, in _domDependentInit self._syncVolumeChain(drive) File "/usr/share/vdsm/virt/vm.py", line 5660, in _syncVolumeChain volumes = self._driveGetActualVolumeChain(drive) File "/usr/share/vdsm/virt/vm.py", line 5639, in _driveGetActualVolumeChain sourceAttr = ('file', 'dev')[drive.blockDev] TypeError: tuple indices must be integers, not NoneType Thread-13::DEBUG::2014-08-19 15:34:23,101::vm::2254::vm.Vm::(_startUnderlyingVm) vmId=`f8262f1c-71d3-4b0a-b14d-3bb060d89c61`::_ongoingCreations released then looks like the same issue, all fixed by http://gerrit.ovirt.org/#/c/31047/ which will be on VDSM 4.16.2. Moving to ON_QA
*** This bug has been marked as a duplicate of bug 1124970 ***