Bug 1131548
Summary: | VM state is 'Paused' after messing with vdsm on a host eventhe VM is effectively Up | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | [Retired] oVirt | Reporter: | Jiri Belka <jbelka> | ||||||
Component: | vdsm | Assignee: | Francesco Romani <fromani> | ||||||
Status: | CLOSED DUPLICATE | QA Contact: | Gil Klein <gklein> | ||||||
Severity: | medium | Docs Contact: | |||||||
Priority: | unspecified | ||||||||
Version: | 3.5 | CC: | acathrow, bazulay, ecohen, gklein, iheim, mgoldboi, ofrenkel, vfeenstr, yeylon | ||||||
Target Milestone: | --- | ||||||||
Target Release: | 3.5.0 | ||||||||
Hardware: | Unspecified | ||||||||
OS: | Unspecified | ||||||||
Whiteboard: | virt | ||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2014-08-20 13:54:27 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | Virt | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Attachments: |
|
Description
Jiri Belka
2014-08-19 14:38:01 UTC
Created attachment 928390 [details]
engine.log
Created attachment 928391 [details]
vdsm.log
clicking on 'Run' button in Admin Portal makes VM state to reflect reality and console opening works. looks like an issue with vdsm recovery? i was able to get the scenario with just start and stop to vdsm service: [root@dubi ~]# vdsClient 0 list table f1f7eba7-4c11-4de5-b69a-e2ed1e98f672 20182 v1 Up [root@dubi ~]# service vdsmd stop Redirecting to /bin/systemctl stop vdsmd.service [root@dubi ~]# service vdsmd start Redirecting to /bin/systemctl start vdsmd.service [root@dubi ~]# vdsClient 0 list table f1f7eba7-4c11-4de5-b69a-e2ed1e98f672 20182 v1 Paused [root@dubi ~]# virsh -r list Id Name State ---------------------------------------------------- 2 v1 running There is something fishy going on here. First thing, is not that bad: not 100% reproduceable all the time one restart VDSM while VM are running. The following on a RHEL7 host of mine: GENji> 13:19:44 root [~]$ vdsClient 0 list table b63bfabf-6be6-4273-b133-3f8a068f52ac 5325 SuperTiny_C0 Up GENji> 13:19:46 root [~]$ service vdsmd stop Redirecting to /bin/systemctl stop vdsmd.service GENji> 13:20:01 root [~]$ service vdsmd start Redirecting to /bin/systemctl start vdsmd.service GENji> 13:20:07 root [~]$ vdsClient 0 list table b63bfabf-6be6-4273-b133-3f8a068f52ac 5325 SuperTiny_C0 Up GENji> 13:20:12 root [~]$ virsh -r list Id Name State ---------------------------------------------------- 2 SuperTiny_C0 running I posted http://gerrit.ovirt.org/#/c/31697/1 tho help understanding this issue. We need a clear(er) reproduceable case to go ahead. Still investigating. The previous comment was using VDSM master. Using VDSM 4.16.1-git: $ grep 'I am' /var/log/vdsm//vdsm.log MainThread::INFO::2014-08-20 13:40:57,827::vdsm::131::vds::(run) (PID: 22719) I am the actual vdsm 4.16.1-32.git2ea9170.el7 genji.rokugan.lan (3.10.0-123.el7.x86_64) same result GENji> 13:41:47 root [~]$ vdsClient 0 list table 56d1c657-dd76-4609-a207-c050699be5be 23247 F20_C1 Up GENji> 13:41:49 root [~]$ service vdsmd stop Redirecting to /bin/systemctl stop vdsmd.service GENji> 13:42:07 root [~]$ vdsClient 0 list table Connection to 0.0.0.0:54321 refused GENji> 13:42:10 root [~]$ virsh -r list Id Name State ---------------------------------------------------- 3 F20_C1 running GENji> 13:42:15 root [~]$ service vdsmd start Redirecting to /bin/systemctl start vdsmd.service GENji> 13:42:21 root [~]$ vdsClient 0 list table 56d1c657-dd76-4609-a207-c050699be5be 23247 F20_C1 Up GENji> 13:42:53 root [~]$ virsh -r list Id Name State ---------------------------------------------------- 3 F20_C1 running GENji> 13:42:56 root [~]$ sleep 5m && vdsClient 0 list table 56d1c657-dd76-4609-a207-c050699be5be 23247 F20_C1 Up GENji> 13:48:08 root [~]$ I'll keep this BZ open to track this issue, but we definitely need more data. FYI I stated that you have to stop both supervdsmd and vdsmd and *wait*... not sure when it was discovered/set to 'Paused', see logs. (In reply to Jiri Belka from comment #7) > FYI I stated that you have to stop both supervdsmd and vdsmd and *wait*... > not sure when it was discovered/set to 'Paused', see logs. Fair enough, I can easily try this way, so I'll do that vdsm-4.16.1-6.gita4a4614.el6.x86_64 on the host (10.34.63.223) *** Bug 1131567 has been marked as a duplicate of this bug. *** The recovery Issue Omer reported into https://bugzilla.redhat.com/show_bug.cgi?id=1131548#c6 was fixed into http://gerrit.ovirt.org/#/c/31047/ which should appear into VDSM 4.16.2. So, no action needed. As for the issue Jiri originally reported, I just realized that if the VDSM *stays* down and it is not just restared, then Engine comes into play and may decide to place the VM in Paused (or in Unknown or something else). I see some suspect lines into the Engine logs: 2014-08-19 16:03:05,340 DEBUG [org.ovirt.engine.core.bll.HandleVdsCpuFlagsOrClusterChangedCommand] (DefaultQuartzScheduler_Worker-7) [79166c6] Permission check skipped for internal action HandleVdsCpuFlagsOrClusterChanged. 2014-08-19 16:03:05,343 INFO [org.ovirt.engine.core.bll.HandleVdsCpuFlagsOrClusterChangedCommand] (DefaultQuartzScheduler_Worker-7) [79166c6] Running command: HandleVdsCpuFlagsOrClusterChangedCommand(RunSilent = false, VdsId = 427d5161-0ec8-442c-9377-6cfe0588c642) internal: true. Entities affected : ID: 427d5161-0ec8-442c-9377-6cfe0588c642 Type: VDS 2014-08-19 16:03:05,341 DEBUG [org.ovirt.engine.core.bll.lock.InMemoryLockManager] (DefaultQuartzScheduler_Worker-53) [51e5d497] Successed acquiring lock EngineLock [exclusiveLocks= key: 396ab187-c838-42dc-94f2-85de7221b874 value: VDS_INIT , sharedLocks= ] succeeded 2014-08-19 16:03:05,358 INFO [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-7) [79166c6] VM jb-rhel7 f8262f1c-71d3-4b0a-b14d-3bb060d89c61 moved from Unknown --> Paused and 2014-08-19 14:32:16,334 ERROR [org.ovirt.engine.core.vdsbroker.ResumeVDSCommand] (org.ovirt.thread.pool-8-thread-10) [16708786] VDS::pause Failed resume vm f8262f1c-71d3-4b0a-b14d-3bb060d89c61 in vds = 396ab187-c838-42dc-94f2-85de7221b874 : dell-r210ii-03, error = org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to ResumeBrokerVDS, error = Unexpected exception, code = 16 2014-08-19 14:32:16,335 INFO [org.ovirt.engine.core.vdsbroker.ResumeVDSCommand] (org.ovirt.thread.pool-8-thread-10) [16708786] FINISH, ResumeVDSCommand, return: Unknown, log id: 55985742 2014-08-19 14:32:16,336 DEBUG [org.ovirt.engine.core.bll.lock.InMemoryLockManager] (org.ovirt.thread.pool-8-thread-10) [16708786] Before releasing a lock EngineLock [exclusiveLocks= key: f8262f1c-71d3-4b0a-b14d-3bb060d89c61 value: VM , sharedLocks= ] 2014-08-19 14:32:16,337 DEBUG [org.ovirt.engine.core.bll.lock.InMemoryLockManager] (org.ovirt.thread.pool-8-thread-10) [16708786] The exclusive lock for key f8262f1c-71d3-4b0a-b14d-3bb060d89c61VM is released and lock is removed from map 2014-08-19 14:32:16,338 INFO [org.ovirt.engine.core.bll.RunVmCommand] (org.ovirt.thread.pool-8-thread-10) [16708786] Lock freed to object EngineLock [exclusiveLocks= key: f8262f1c-71d3-4b0a-b14d-3bb060d89c61 value: VM , sharedLocks= ] 2014-08-19 14:32:16,339 ERROR [org.ovirt.engine.core.bll.RunVmCommand] (org.ovirt.thread.pool-8-thread-10) [16708786] Command org.ovirt.engine.core.bll.RunVmCommand throw Vdc Bll exception. With error message VdcBLLException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to ResumeBrokerVDS, error = Unexpected exception, code = 16 (Failed with error unexpected and code 16) 2014-08-19 14:32:16,341 DEBUG [org.ovirt.engine.core.bll.RunVmCommand] (org.ovirt.thread.pool-8-thread-10) [16708786] Command org.ovirt.engine.core.bll.RunVmCommand throw Vdc Bll exception: org.ovirt.engine.core.common.errors.VdcBLLException: VdcBLLException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to ResumeBrokerVDS, error = Unexpected exception, code = 16 (Failed with error unexpected and code 16) at org.ovirt.engine.core.bll.VdsHandler.handleVdsResult(VdsHandler.java:117) [bll.jar:] at org.ovirt.engine.core.bll.VDSBrokerFrontendImpl.RunVdsCommand(VDSBrokerFrontendImpl.java:33) [bll.jar:] at org.ovirt.engine.core.bll.VDSBrokerFrontendImpl.RunAsyncVdsCommand(VDSBrokerFrontendImpl.java:48) [bll.jar:] at org.ovirt.engine.core.bll.RunVmCommand.resumeVm(RunVmCommand.java:207) [bll.jar:] at org.ovirt.engine.core.bll.RunVmCommand.perform(RunVmCommand.java:332) [bll.jar:] at org.ovirt.engine.core.bll.RunVmCommand.executeVmCommand(RunVmCommand.java:274) [bll.jar:] at org.ovirt.engine.core.bll.VmCommand.executeCommand(VmCommand.java:99) [bll.jar:] at org.ovirt.engine.core.bll.CommandBase.executeWithoutTransaction(CommandBase.java:1135) [bll.jar:] at org.ovirt.engine.core.bll.CommandBase.executeActionInTransactionScope(CommandBase.java:1254) [bll.jar:] at org.ovirt.engine.core.bll.CommandBase.runInTransaction(CommandBase.java:1874) [bll.jar:] at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInSuppressed(TransactionSupport.java:174) [utils.jar:] at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInScope(TransactionSupport.java:116) [utils.jar:] at org.ovirt.engine.core.bll.CommandBase.execute(CommandBase.java:1275) [bll.jar:] at org.ovirt.engine.core.bll.CommandBase.executeAction(CommandBase.java:345) [bll.jar:] at org.ovirt.engine.core.bll.MultipleActionsRunner.executeValidatedCommand(MultipleActionsRunner.java:191) [bll.jar:] at org.ovirt.engine.core.bll.MultipleActionsRunner.runCommands(MultipleActionsRunner.java:158) [bll.jar:] at org.ovirt.engine.core.bll.SortedMultipleActionsRunnerBase.runCommands(SortedMultipleActionsRunnerBase.java:20) [bll.jar:] at org.ovirt.engine.core.bll.MultipleActionsRunner$2.run(MultipleActionsRunner.java:167) [bll.jar:] at org.ovirt.engine.core.utils.threadpool.ThreadPoolUtil$InternalWrapperRunnable.run(ThreadPoolUtil.java:90) [utils.jar:] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) [rt.jar:1.7.0_55] at java.util.concurrent.FutureTask.run(FutureTask.java:262) [rt.jar:1.7.0_55] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [rt.jar:1.7.0_55] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [rt.jar:1.7.0_55] at java.lang.Thread.run(Thread.java:744) [rt.jar:1.7.0_55] Omer, do you think the above sentences makes sense? when vdsm is stopped, it is treated as non-resposive from the engine point-of-view, so when no power-mgmt defiend, the vms are marked as 'unknown'. once vdsm responds again, we get the new status, this is when engine get 'paused' and you see in the log: '...moved from Unknown --> Paused' after that, when user try to start the paused vm, vdsm returns error (" Failed with error unexpected and code 16") but then the vm returns to up. so i believe its all the same case, and there is no other issue here (In reply to Omer Frenkel from comment #12) > so i believe its all the same case, and there is no other issue here I now believe so as well since I found this on VDSM logs Jiri provided Thread-13::DEBUG::2014-08-19 15:34:23,100::vm::1357::vm.Vm::(blockDev) vmId=`f8262f1c-71d3-4b0a-b14d-3bb060d89c61`::Unable to determine if the path '/rhev/data-center/00000002-0002-0002-0002-000000000038/f3298d58-7f41-4d71-8a06-560a69451ab0/images/8e03de77-e237-49ca-b8ac-46cf283e78b7/c6ca3708-8adb-4d7e-8f88-a81761539dcf' is a block device Traceback (most recent call last): File "/usr/share/vdsm/virt/vm.py", line 1354, in blockDev self._blockDev = utils.isBlockDevice(self.path) File "/usr/lib64/python2.6/site-packages/vdsm/utils.py", line 104, in isBlockDevice return stat.S_ISBLK(os.stat(path).st_mode) OSError: [Errno 2] No such file or directory: '/rhev/data-center/00000002-0002-0002-0002-000000000038/f3298d58-7f41-4d71-8a06-560a69451ab0/images/8e03de77-e237-49ca-b8ac-46cf283e78b7/c6ca3708-8adb-4d7e-8f88-a81761539dcf' Thread-13::INFO::2014-08-19 15:34:23,101::vm::2251::vm.Vm::(_startUnderlyingVm) vmId=`f8262f1c-71d3-4b0a-b14d-3bb060d89c61`::Skipping errors on recovery Traceback (most recent call last): File "/usr/share/vdsm/virt/vm.py", line 2235, in _startUnderlyingVm self._run() File "/usr/share/vdsm/virt/vm.py", line 3298, in _run self._domDependentInit() File "/usr/share/vdsm/virt/vm.py", line 3189, in _domDependentInit self._syncVolumeChain(drive) File "/usr/share/vdsm/virt/vm.py", line 5660, in _syncVolumeChain volumes = self._driveGetActualVolumeChain(drive) File "/usr/share/vdsm/virt/vm.py", line 5639, in _driveGetActualVolumeChain sourceAttr = ('file', 'dev')[drive.blockDev] TypeError: tuple indices must be integers, not NoneType Thread-13::DEBUG::2014-08-19 15:34:23,101::vm::2254::vm.Vm::(_startUnderlyingVm) vmId=`f8262f1c-71d3-4b0a-b14d-3bb060d89c61`::_ongoingCreations released then looks like the same issue, all fixed by http://gerrit.ovirt.org/#/c/31047/ which will be on VDSM 4.16.2. Moving to ON_QA *** This bug has been marked as a duplicate of bug 1124970 *** |