Bug 1131548

Summary:

VM state is 'Paused' after messing with vdsm on a host eventhe VM is effectively Up

Product:

[Retired] oVirt

Reporter:

Jiri Belka <jbelka>

Component:

vdsm

Assignee:

Francesco Romani <fromani>

Status:

CLOSED DUPLICATE

QA Contact:

Gil Klein <gklein>

Severity:

medium

Docs Contact:

Priority:

unspecified

Version:

3.5

CC:

acathrow, bazulay, ecohen, gklein, iheim, mgoldboi, ofrenkel, vfeenstr, yeylon

Target Milestone:

---

Target Release:

3.5.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

virt

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2014-08-20 13:54:27 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

Virt

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
engine.log	none
vdsm.log	none

Description Jiri Belka 2014-08-19 14:38:01 UTC

Description of problem:
VM state is 'Paused' after messing with vdsm on a host eventhe VM is effectively Up. It is odd.

How did that happened? It's side-effect of BZ1131545. Stopping supervdsmd/vdsdm and waiting a while... but the VM was running all the time and my SSH connection into VM was working all the time. Engine incorrectly detects the VM state.

Version-Release number of selected component (if applicable):
ovirt-engine-backend-3.5.0-0.0.master.20140804172041.git23b558e.el6.noarch

How reproducible:
100%

Steps to Reproduce:
1. stop supervdsmd/vdsmd on a host while having a VM running
2. wait
3.

Actual results:
VM status changes to 'Paused' in a while but the VM is effectively still up and working OK

Expected results:
show reality

Additional info:

see the qemu-kvm process is "older" than vdsm stuff executed by ssh soft fencing...

# ps aux | egrep '[v]dsmd|[q]emu-kvm'
qemu      2937  1.1  5.2 1810716 424208 ?      Sl   16:00   0:24 /usr/libexec/qemu-kvm -name jb-rhel7 -S -M rhel6.5.0 -cpu SandyBridge -enable-kvm -m 1024 -realtime mlock=off -smp 1,maxcpus=160,sockets=160,cores=1,threads=1 -uuid f8262f1c-71d3-4b0a-b14d-3bb060d89c61 -smbios type=1,manufacturer=oVirt,product=oVirt Node,version=6Server-6.5.0.1.el6,serial=4C4C4544-0058-3410-8058-C2C04F38354A,uuid=f8262f1c-71d3-4b0a-b14d-3bb060d89c61 -nodefconfig -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/jb-rhel7.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=2014-08-19T14:00:34,driftfix=slew -no-kvm-pit-reinjection -no-shutdown -device ich9-usb-ehci1,id=usb,bus=pci.0,addr=0x5.0x7 -device ich9-usb-uhci3,masterbus=usb.0,firstport=4,bus=pci.0,addr=0x5.0x2 -device ich9-usb-uhci1,masterbus=usb.0,firstport=0,bus=pci.0,multifunction=on,addr=0x5 -device ich9-usb-uhci2,masterbus=usb.0,firstport=2,bus=pci.0,addr=0x5.0x1 -device virtio-scsi-pci,id=scsi0,bus=pci.0,addr=0x6 -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x7 -drive if=none,media=cdrom,id=drive-ide0-1-0,readonly=on,format=raw,serial= -device ide-drive,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -drive file=/rhev/data-center/00000002-0002-0002-0002-000000000038/f3298d58-7f41-4d71-8a06-560a69451ab0/images/8e03de77-e237-49ca-b8ac-46cf283e78b7/c6ca3708-8adb-4d7e-8f88-a81761539dcf,if=none,id=drive-virtio-disk0,format=qcow2,serial=8e03de77-e237-49ca-b8ac-46cf283e78b7,cache=none,werror=stop,rerror=stop,aio=threads -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x8,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -netdev tap,fd=28,id=hostnet0,vhost=on,vhostfd=29 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=00:1a:4a:32:a5:00,bus=pci.0,addr=0x3 -chardev socket,id=charchannel0,path=/var/lib/libvirt/qemu/channels/f8262f1c-71d3-4b0a-b14d-3bb060d89c61.com.redhat.rhevm.vdsm,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=com.redhat.rhevm.vdsm -chardev socket,id=charchannel1,path=/var/lib/libvirt/qemu/channels/f8262f1c-71d3-4b0a-b14d-3bb060d89c61.org.qemu.guest_agent.0,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=2,chardev=charchannel1,id=channel1,name=org.qemu.guest_agent.0 -chardev spicevmc,id=charchannel2,name=vdagent -device virtserialport,bus=virtio-serial0.0,nr=3,chardev=charchannel2,id=channel2,name=com.redhat.spice.0 -spice port=5900,tls-port=5901,addr=10.34.63.223,x509-dir=/etc/pki/vdsm/libvirt-spice,tls-channel=main,tls-channel=display,tls-channel=inputs,tls-channel=cursor,tls-channel=playback,tls-channel=record,tls-channel=smartcard,tls-channel=usbredir,seamless-migration=on -k en-us -vga qxl -global qxl-vga.ram_size=67108864 -global qxl-vga.vram_size=33554432 -device intel-hda,id=sound0,bus=pci.0,addr=0x4 -device hda-duplex,id=sound0-codec0,bus=sound0.0,cad=0 -chardev spicevmc,id=charredir0,name=usbredir -device usb-redir,chardev=charredir0,id=redir0 -chardev spicevmc,id=charredir1,name=usbredir -device usb-redir,chardev=charredir1,id=redir1 -chardev spicevmc,id=charredir2,name=usbredir -device usb-redir,chardev=charredir2,id=redir2 -chardev spicevmc,id=charredir3,name=usbredir -device usb-redir,chardev=charredir3,id=redir3 -incoming tcp:[::]:49153 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x9
root      3229  0.0  0.0  11340   688 ?        S<   16:02   0:00 /bin/bash -e /usr/share/vdsm/respawn --minlifetime 10 --daemon --masterpid /var/run/vdsm/supervdsm_respawn.pid /usr/share/vdsm/supervdsmServer --sockfile /var/run/vdsm/svdsm.sock --pidfile /var/run/vdsm/supervdsmd.pid
root      3232  0.0  0.3 654280 28548 ?        S<l  16:02   0:00 /usr/bin/python /usr/share/vdsm/supervdsmServer --sockfile /var/run/vdsm/svdsm.sock --pidfile /var/run/vdsm/supervdsmd.pid
vdsm      3326  0.0  0.0   9236   680 ?        S<   16:03   0:00 /bin/bash -e /usr/share/vdsm/respawn --minlifetime 10 --daemon --masterpid /var/run/vdsm/respawn.pid /usr/share/vdsm/vdsm --pidfile /var/run/vdsm/vdsmd.pid
vdsm      3329  4.5  0.7 2871756 62012 ?       S<l  16:03   1:32 /usr/bin/python /usr/share/vdsm/vdsm --pidfile /var/run/vdsm/vdsmd.pid

Comment 1 Jiri Belka 2014-08-19 14:38:45 UTC

Created attachment 928390 [details]
engine.log

Comment 2 Jiri Belka 2014-08-19 14:39:21 UTC

Created attachment 928391 [details]
vdsm.log

Comment 3 Jiri Belka 2014-08-19 14:40:18 UTC

clicking on 'Run' button in Admin Portal makes VM state to reflect reality and console opening works.

Comment 4 Omer Frenkel 2014-08-20 09:10:30 UTC

looks like an issue with vdsm recovery?
i was able to get the scenario with just start and stop to vdsm service:

[root@dubi ~]#  vdsClient 0 list table
f1f7eba7-4c11-4de5-b69a-e2ed1e98f672  20182  v1                   Up                                       
[root@dubi ~]# service vdsmd stop
Redirecting to /bin/systemctl stop  vdsmd.service
[root@dubi ~]# service vdsmd start
Redirecting to /bin/systemctl start  vdsmd.service
[root@dubi ~]#  vdsClient 0 list table
f1f7eba7-4c11-4de5-b69a-e2ed1e98f672  20182  v1                   Paused                                   
[root@dubi ~]# virsh -r list
 Id    Name                           State
----------------------------------------------------
 2     v1                             running

Comment 5 Francesco Romani 2014-08-20 11:23:28 UTC

There is something fishy going on here.

First thing, is not that bad: not 100% reproduceable all the time one restart VDSM while VM are running.

The following on a RHEL7 host of mine:

GENji> 13:19:44 root [~]$ vdsClient 0 list table
b63bfabf-6be6-4273-b133-3f8a068f52ac   5325  SuperTiny_C0         Up                                       
GENji> 13:19:46 root [~]$ service vdsmd stop
Redirecting to /bin/systemctl stop  vdsmd.service
GENji> 13:20:01 root [~]$ service vdsmd start
Redirecting to /bin/systemctl start  vdsmd.service
GENji> 13:20:07 root [~]$ vdsClient 0 list table
b63bfabf-6be6-4273-b133-3f8a068f52ac   5325  SuperTiny_C0         Up                                       
GENji> 13:20:12 root [~]$ virsh -r list
 Id    Name                           State
----------------------------------------------------
 2     SuperTiny_C0                   running

I posted http://gerrit.ovirt.org/#/c/31697/1 tho help understanding this issue.

We need a clear(er) reproduceable case to go ahead. Still investigating.

Comment 6 Francesco Romani 2014-08-20 11:49:48 UTC

The previous comment was using VDSM master. Using VDSM 4.16.1-git:

$ grep 'I am' /var/log/vdsm//vdsm.log
MainThread::INFO::2014-08-20 13:40:57,827::vdsm::131::vds::(run) (PID: 22719) I am the actual vdsm 4.16.1-32.git2ea9170.el7 genji.rokugan.lan (3.10.0-123.el7.x86_64)

same result

GENji> 13:41:47 root [~]$ vdsClient 0 list table
56d1c657-dd76-4609-a207-c050699be5be  23247  F20_C1               Up                                       
GENji> 13:41:49 root [~]$ service vdsmd stop
Redirecting to /bin/systemctl stop  vdsmd.service
GENji> 13:42:07 root [~]$ vdsClient 0 list table
Connection to 0.0.0.0:54321 refused
GENji> 13:42:10 root [~]$ virsh -r list
 Id    Name                           State
----------------------------------------------------
 3     F20_C1                         running

GENji> 13:42:15 root [~]$ service vdsmd start
Redirecting to /bin/systemctl start  vdsmd.service
GENji> 13:42:21 root [~]$ vdsClient 0 list table
56d1c657-dd76-4609-a207-c050699be5be  23247  F20_C1               Up                                       
GENji> 13:42:53 root [~]$ virsh -r list
 Id    Name                           State
----------------------------------------------------
 3     F20_C1                         running

GENji> 13:42:56 root [~]$ sleep 5m && vdsClient 0 list table
56d1c657-dd76-4609-a207-c050699be5be  23247  F20_C1               Up                                       
GENji> 13:48:08 root [~]$ 

I'll keep this BZ open to track this issue, but we definitely need more data.

Comment 7 Jiri Belka 2014-08-20 12:05:18 UTC

FYI I stated that you have to stop both supervdsmd and vdsmd and *wait*... not sure when it was discovered/set to 'Paused', see logs.

Comment 8 Francesco Romani 2014-08-20 12:10:11 UTC

(In reply to Jiri Belka from comment #7)
> FYI I stated that you have to stop both supervdsmd and vdsmd and *wait*...
> not sure when it was discovered/set to 'Paused', see logs.

Fair enough, I can easily try this way, so I'll do that

Comment 9 Jiri Belka 2014-08-20 12:25:24 UTC

vdsm-4.16.1-6.gita4a4614.el6.x86_64 on the host (10.34.63.223)

Comment 10 Vinzenz Feenstra [evilissimo] 2014-08-20 12:35:09 UTC

*** Bug 1131567 has been marked as a duplicate of this bug. ***

Comment 11 Francesco Romani 2014-08-20 12:44:56 UTC

The recovery Issue Omer reported into https://bugzilla.redhat.com/show_bug.cgi?id=1131548#c6

was fixed into http://gerrit.ovirt.org/#/c/31047/

which should appear into VDSM 4.16.2. So, no action needed.

As for the issue Jiri originally reported, I just realized that if the VDSM *stays* down and it is not just restared, then Engine comes into play and may decide to place the VM in Paused (or in Unknown or something else).

I see some suspect lines into the Engine logs:

2014-08-19 16:03:05,340 DEBUG [org.ovirt.engine.core.bll.HandleVdsCpuFlagsOrClusterChangedCommand] (DefaultQuartzScheduler_Worker-7) [79166c6] Permission check skipped for internal action HandleVdsCpuFlagsOrClusterChanged.
2014-08-19 16:03:05,343 INFO  [org.ovirt.engine.core.bll.HandleVdsCpuFlagsOrClusterChangedCommand] (DefaultQuartzScheduler_Worker-7) [79166c6] Running command: HandleVdsCpuFlagsOrClusterChangedCommand(RunSilent = false, VdsId = 427d5161-0ec8-442c-9377-6cfe0588c642) internal: true. Entities affected :  ID: 427d5161-0ec8-442c-9377-6cfe0588c642 Type: VDS
2014-08-19 16:03:05,341 DEBUG [org.ovirt.engine.core.bll.lock.InMemoryLockManager] (DefaultQuartzScheduler_Worker-53) [51e5d497] Successed acquiring lock EngineLock [exclusiveLocks= key: 396ab187-c838-42dc-94f2-85de7221b874 value: VDS_INIT
, sharedLocks= ] succeeded 
2014-08-19 16:03:05,358 INFO  [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-7) [79166c6] VM jb-rhel7 f8262f1c-71d3-4b0a-b14d-3bb060d89c61 moved from Unknown --> Paused


and

2014-08-19 14:32:16,334 ERROR [org.ovirt.engine.core.vdsbroker.ResumeVDSCommand] (org.ovirt.thread.pool-8-thread-10) [16708786] VDS::pause Failed resume vm f8262f1c-71d3-4b0a-b14d-3bb060d89c61 in vds = 396ab187-c838-42dc-94f2-85de7221b874 : dell-r210ii-03, error = org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to ResumeBrokerVDS, error = Unexpected exception, code = 16
2014-08-19 14:32:16,335 INFO  [org.ovirt.engine.core.vdsbroker.ResumeVDSCommand] (org.ovirt.thread.pool-8-thread-10) [16708786] FINISH, ResumeVDSCommand, return: Unknown, log id: 55985742
2014-08-19 14:32:16,336 DEBUG [org.ovirt.engine.core.bll.lock.InMemoryLockManager] (org.ovirt.thread.pool-8-thread-10) [16708786] Before releasing a lock EngineLock [exclusiveLocks= key: f8262f1c-71d3-4b0a-b14d-3bb060d89c61 value: VM
, sharedLocks= ]
2014-08-19 14:32:16,337 DEBUG [org.ovirt.engine.core.bll.lock.InMemoryLockManager] (org.ovirt.thread.pool-8-thread-10) [16708786] The exclusive lock for key f8262f1c-71d3-4b0a-b14d-3bb060d89c61VM is released and lock is removed from map
2014-08-19 14:32:16,338 INFO  [org.ovirt.engine.core.bll.RunVmCommand] (org.ovirt.thread.pool-8-thread-10) [16708786] Lock freed to object EngineLock [exclusiveLocks= key: f8262f1c-71d3-4b0a-b14d-3bb060d89c61 value: VM
, sharedLocks= ]
2014-08-19 14:32:16,339 ERROR [org.ovirt.engine.core.bll.RunVmCommand] (org.ovirt.thread.pool-8-thread-10) [16708786] Command org.ovirt.engine.core.bll.RunVmCommand throw Vdc Bll exception. With error message VdcBLLException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to ResumeBrokerVDS, error = Unexpected exception, code = 16 (Failed with error unexpected and code 16)
2014-08-19 14:32:16,341 DEBUG [org.ovirt.engine.core.bll.RunVmCommand] (org.ovirt.thread.pool-8-thread-10) [16708786] Command org.ovirt.engine.core.bll.RunVmCommand throw Vdc Bll exception: org.ovirt.engine.core.common.errors.VdcBLLException: VdcBLLException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to ResumeBrokerVDS, error = Unexpected exception, code = 16 (Failed with error unexpected and code 16)
	at org.ovirt.engine.core.bll.VdsHandler.handleVdsResult(VdsHandler.java:117) [bll.jar:]
	at org.ovirt.engine.core.bll.VDSBrokerFrontendImpl.RunVdsCommand(VDSBrokerFrontendImpl.java:33) [bll.jar:]
	at org.ovirt.engine.core.bll.VDSBrokerFrontendImpl.RunAsyncVdsCommand(VDSBrokerFrontendImpl.java:48) [bll.jar:]
	at org.ovirt.engine.core.bll.RunVmCommand.resumeVm(RunVmCommand.java:207) [bll.jar:]
	at org.ovirt.engine.core.bll.RunVmCommand.perform(RunVmCommand.java:332) [bll.jar:]
	at org.ovirt.engine.core.bll.RunVmCommand.executeVmCommand(RunVmCommand.java:274) [bll.jar:]
	at org.ovirt.engine.core.bll.VmCommand.executeCommand(VmCommand.java:99) [bll.jar:]
	at org.ovirt.engine.core.bll.CommandBase.executeWithoutTransaction(CommandBase.java:1135) [bll.jar:]
	at org.ovirt.engine.core.bll.CommandBase.executeActionInTransactionScope(CommandBase.java:1254) [bll.jar:]
	at org.ovirt.engine.core.bll.CommandBase.runInTransaction(CommandBase.java:1874) [bll.jar:]
	at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInSuppressed(TransactionSupport.java:174) [utils.jar:]
	at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInScope(TransactionSupport.java:116) [utils.jar:]
	at org.ovirt.engine.core.bll.CommandBase.execute(CommandBase.java:1275) [bll.jar:]
	at org.ovirt.engine.core.bll.CommandBase.executeAction(CommandBase.java:345) [bll.jar:]
	at org.ovirt.engine.core.bll.MultipleActionsRunner.executeValidatedCommand(MultipleActionsRunner.java:191) [bll.jar:]
	at org.ovirt.engine.core.bll.MultipleActionsRunner.runCommands(MultipleActionsRunner.java:158) [bll.jar:]
	at org.ovirt.engine.core.bll.SortedMultipleActionsRunnerBase.runCommands(SortedMultipleActionsRunnerBase.java:20) [bll.jar:]
	at org.ovirt.engine.core.bll.MultipleActionsRunner$2.run(MultipleActionsRunner.java:167) [bll.jar:]
	at org.ovirt.engine.core.utils.threadpool.ThreadPoolUtil$InternalWrapperRunnable.run(ThreadPoolUtil.java:90) [utils.jar:]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) [rt.jar:1.7.0_55]
	at java.util.concurrent.FutureTask.run(FutureTask.java:262) [rt.jar:1.7.0_55]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [rt.jar:1.7.0_55]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [rt.jar:1.7.0_55]
	at java.lang.Thread.run(Thread.java:744) [rt.jar:1.7.0_55]



Omer, do you think the above sentences makes sense?

Comment 12 Omer Frenkel 2014-08-20 13:22:11 UTC

when vdsm is stopped, it is treated as non-resposive from the engine point-of-view, so when no power-mgmt defiend, the vms are marked as 'unknown'.
once vdsm responds again, we get the new status, this is when engine get 'paused' and you see in the log: '...moved from Unknown --> Paused'

after that, when user try to start the paused vm, vdsm returns error ("
Failed with error unexpected and code 16") but then the vm returns to up.

so i believe its all the same case, and there is no other issue here

Comment 13 Francesco Romani 2014-08-20 13:51:06 UTC

(In reply to Omer Frenkel from comment #12)
> so i believe its all the same case, and there is no other issue here

I now believe so as well since I found this on VDSM logs Jiri provided

Thread-13::DEBUG::2014-08-19 15:34:23,100::vm::1357::vm.Vm::(blockDev) vmId=`f8262f1c-71d3-4b0a-b14d-3bb060d89c61`::Unable to determine if the path '/rhev/data-center/00000002-0002-0002-0002-000000000038/f3298d58-7f41-4d71-8a06-560a69451ab0/images/8e03de77-e237-49ca-b8ac-46cf283e78b7/c6ca3708-8adb-4d7e-8f88-a81761539dcf' is a block device
Traceback (most recent call last):
  File "/usr/share/vdsm/virt/vm.py", line 1354, in blockDev
    self._blockDev = utils.isBlockDevice(self.path)
  File "/usr/lib64/python2.6/site-packages/vdsm/utils.py", line 104, in isBlockDevice
    return stat.S_ISBLK(os.stat(path).st_mode)
OSError: [Errno 2] No such file or directory: '/rhev/data-center/00000002-0002-0002-0002-000000000038/f3298d58-7f41-4d71-8a06-560a69451ab0/images/8e03de77-e237-49ca-b8ac-46cf283e78b7/c6ca3708-8adb-4d7e-8f88-a81761539dcf'
Thread-13::INFO::2014-08-19 15:34:23,101::vm::2251::vm.Vm::(_startUnderlyingVm) vmId=`f8262f1c-71d3-4b0a-b14d-3bb060d89c61`::Skipping errors on recovery
Traceback (most recent call last):
  File "/usr/share/vdsm/virt/vm.py", line 2235, in _startUnderlyingVm
    self._run()
  File "/usr/share/vdsm/virt/vm.py", line 3298, in _run
    self._domDependentInit()
  File "/usr/share/vdsm/virt/vm.py", line 3189, in _domDependentInit
    self._syncVolumeChain(drive)
  File "/usr/share/vdsm/virt/vm.py", line 5660, in _syncVolumeChain
    volumes = self._driveGetActualVolumeChain(drive)
  File "/usr/share/vdsm/virt/vm.py", line 5639, in _driveGetActualVolumeChain
    sourceAttr = ('file', 'dev')[drive.blockDev]
TypeError: tuple indices must be integers, not NoneType
Thread-13::DEBUG::2014-08-19 15:34:23,101::vm::2254::vm.Vm::(_startUnderlyingVm) vmId=`f8262f1c-71d3-4b0a-b14d-3bb060d89c61`::_ongoingCreations released

then looks like the same issue, all fixed by http://gerrit.ovirt.org/#/c/31047/
which will be on VDSM 4.16.2.

Moving to ON_QA

Comment 14 Vinzenz Feenstra [evilissimo] 2014-08-20 13:54:27 UTC


*** This bug has been marked as a duplicate of bug 1124970 ***