973081 – A non-SPM host which loses access to storage keeps oscillating between Non-Operational and Up state

Bug 973081 - A non-SPM host which loses access to storage keeps oscillating between Non-Operational and Up state

Summary: A non-SPM host which loses access to storage keeps oscillating between Non-Op...

Keywords:
Status:	CLOSED WORKSFORME
Alias:	None
Product:	ovirt-engine
Classification:	oVirt
Component:	General
Sub Component:
Version:	---
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	ovirt-4.0.0-beta
Target Release:	---
Assignee:	Liron Aravot
QA Contact:	Aharon Canan
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2013-06-11 08:55 UTC by Jiri Belka
Modified:	2016-05-08 08:14 UTC (History)
CC List:	17 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2016-05-08 08:14:42 UTC
oVirt Team:	Storage
Embargoed:
Dependent Products:
Flags:	ylavi: ovirt-4.0.0? rule-engine: planning_ack? rule-engine: devel_ack? rule-engine: testing_ack?

Attachments	(Terms of Use)
engine.log and vdsm.log(s) (191.25 KB, application/x-tar) 2013-06-11 08:55 UTC, Jiri Belka	no flags	Details
logs-3.6 (1.69 MB, application/x-gzip) 2015-11-02 09:45 UTC, Elad	no flags	Details
View All

Description Jiri Belka 2013-06-11 08:55:42 UTC

Created attachment 759485 [details]
engine.log and vdsm.log(s)

Description of problem:
I had a VM running on a host and I disabled access to storage nfs with iptables on this host. Then migration of this VM has been started but it does never finish. Thus the host is still up and its status is changing from 'Non-Operational' to 'Up' and back all the time. Power mgmt was setup OK and test worked OK.

Version-Release number of selected component (if applicable):
sf17.5

How reproducible:
100%

Steps to Reproduce:
1. 2 hosts with power mgmt setup and working
2. start a vm
3. on the host running the vm disable with iptables access to data SD

Actual results:
VM is in process of migration for > 30 minutes, host status is switching between 'Up' and 'Non-Operational' status all the time

Expected results:
if migration is choosed, it should be checked that it really works and if not some timeout some be implemented and then the host should be fenced anyway.

Additional info:

* source host:

[root@dell-r210ii-03 ~]# ps ax | grep '[q]emu-kvm'
 8133 ?        Sl     0:19 /usr/libexec/qemu-kvm -name jb-rh6-x64 -S -M rhel6.4.0 -cpu SandyBridge -enable-kvm -m 512 -smp 1,sockets=1,cores=1,threads=1 -uuid f7fe0970-34aa-40ff-8a46-d0ad16698a7c -smbios type=1,manufacturer=Red Hat,product=RHEV Hypervisor,version=6Server-6.4.0.4.el6,serial=4C4C4544-0058-3410-8058-C3C04F38354A,uuid=f7fe0970-34aa-40ff-8a46-d0ad16698a7c -nodefconfig -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/jb-rh6-x64.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=2013-06-11T10:05:19,driftfix=slew -no-shutdown -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x4 -drive if=none,media=cdrom,id=drive-ide0-1-0,readonly=on,format=raw,serial= -device ide-drive,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -drive file=/rhev/data-center/5849b030-626e-47cb-ad90-3ce782d831b3/f15b179b-bbca-4d19-bc1b-fccb2e2fad95/images/3afcd443-c416-4897-b866-6e618df20302/194578cf-affa-47ff-afb7-01977b3d87b6,if=none,id=drive-virtio-disk0,format=qcow2,serial=3afcd443-c416-4897-b866-6e618df20302,cache=none,werror=stop,rerror=stop,aio=threads -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -netdev tap,fd=27,id=hostnet0,vhost=on,vhostfd=28 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=00:1a:4a:e0:3f:86,bus=pci.0,addr=0x3 -chardev socket,id=charchannel0,path=/var/lib/libvirt/qemu/channels/jb-rh6-x64.com.redhat.rhevm.vdsm,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=com.redhat.rhevm.vdsm -chardev socket,id=charchannel1,path=/var/lib/libvirt/qemu/channels/jb-rh6-x64.org.qemu.guest_agent.0,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=2,chardev=charchannel1,id=channel1,name=org.qemu.guest_agent.0 -chardev spicevmc,id=charchannel2,name=vdagent -device virtserialport,bus=virtio-serial0.0,nr=3,chardev=charchannel2,id=channel2,name=com.redhat.spice.0 -spice port=5900,tls-port=5901,addr=0,x509-dir=/etc/pki/vdsm/libvirt-spice,tls-channel=main,tls-channel=display,tls-channel=inputs,tls-channel=cursor,tls-channel=playback,tls-channel=record,tls-channel=smartcard,tls-channel=usbredir,seamless-migration=on -k en-us -vga qxl -global qxl-vga.ram_size=67108864 -global qxl-vga.vram_size=67108864 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6

[root@dell-r210ii-03 ~]# 
Message from syslogd@dell-r210ii-03 at Jun 11 10:52:12 ...
 �<11>vdsm vm.Vm ERROR vmId=`f7fe0970-34aa-40ff-8a46-d0ad16698a7c`::Unable to update the volume 194578cf-affa-47ff-afb7-01977b3d87b6 (domain: f15b179b-bbca-4d19-bc1b-fccb2e2fad95 image: 3afcd443-c416-4897-b866-6e618df20302) for the drive vda

* destination host:

[root@dell-r210ii-04 ~]# ps ax | grep '[q]emu-kvm'
 7576 ?        Sl     0:00 /usr/libexec/qemu-kvm -name jb-rh6-x64 -S -M rhel6.4.0 -cpu SandyBridge -enable-kvm -m 512 -smp 1,sockets=1,cores=1,threads=1 -uuid f7fe0970-34aa-40ff-8a46-d0ad16698a7c -smbios type=1,manufacturer=Red Hat,product=RHEV Hypervisor,version=6Server-6.4.0.4.el6,serial=4C4C4544-0058-3410-8058-C3C04F38354A,uuid=f7fe0970-34aa-40ff-8a46-d0ad16698a7c -nodefconfig -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/jb-rh6-x64.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=2013-06-11T10:13:59,driftfix=slew -no-shutdown -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x4 -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive if=none,media=cdrom,id=drive-ide0-1-0,readonly=on,format=raw,serial= -device ide-drive,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -drive file=/rhev/data-center/5849b030-626e-47cb-ad90-3ce782d831b3/f15b179b-bbca-4d19-bc1b-fccb2e2fad95/images/3afcd443-c416-4897-b866-6e618df20302/194578cf-affa-47ff-afb7-01977b3d87b6,if=none,id=drive-virtio-disk0,format=qcow2,serial=3afcd443-c416-4897-b866-6e618df20302,cache=none,werror=stop,rerror=stop,aio=threads -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -netdev tap,fd=29,id=hostnet0,vhost=on,vhostfd=30 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=00:1a:4a:e0:3f:86,bus=pci.0,addr=0x3 -chardev socket,id=charchannel0,path=/var/lib/libvirt/qemu/channels/jb-rh6-x64.com.redhat.rhevm.vdsm,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=com.redhat.rhevm.vdsm -chardev socket,id=charchannel1,path=/var/lib/libvirt/qemu/channels/jb-rh6-x64.org.qemu.guest_agent.0,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=2,chardev=charchannel1,id=channel1,name=org.qemu.guest_agent.0 -chardev spicevmc,id=charchannel2,name=vdagent -device virtserialport,bus=virtio-serial0.0,nr=3,chardev=charchannel2,id=channel2,name=com.redhat.spice.0 -spice port=5900,tls-port=5901,addr=0,x509-dir=/etc/pki/vdsm/libvirt-spice,tls-channel=main,tls-channel=display,tls-channel=inputs,tls-channel=cursor,tls-channel=playback,tls-channel=record,tls-channel=smartcard,tls-channel=usbredir,seamless-migration=on -k en-us -vga qxl -global qxl-vga.ram_size=67108864 -global qxl-vga.vram_size=67108864 -incoming tcp:[::]:49152 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6

Comment 1 Jiri Belka 2013-06-11 14:57:01 UTC

Modified summary per hateya@ request as fencing should not be done on non-SPM hosts. Thus the issue remais never ended migration.

Comment 2 Michal Skrivanek 2013-06-13 09:08:29 UTC

maybe related to recent fencing changes?

Comment 4 Michal Skrivanek 2014-04-01 08:25:55 UTC

Francesco, can you please check if this is relavted to the host status flapping on storage issue.
The migration will obviously never finish if the VM can't access  storage, so only the flapping issue remains

Comment 7 Francesco Romani 2014-04-08 14:53:32 UTC

taking the bug

Comment 8 Francesco Romani 2014-04-09 15:30:55 UTC

The symptoms look different with respect to https://bugzilla.redhat.com/show_bug.cgi?id=866484

So, will look to understand why this one has the flapping status.

Comment 9 Francesco Romani 2014-04-11 15:39:37 UTC

The reason why the host goes back and forth between 'NonOperational' and 'Up'
seem the normal autorecovery flow:

$ grep -i 'autorecovering host' engine.log 
2013-06-11 10:15:00,035 INFO  [org.ovirt.engine.core.bll.AutoRecoveryManager] (QuartzScheduler_Worker-92) Autorecovering hosts id: 9f0aace8-e908-43e4-a6e4-39b1d6f538bb, name : dell-r210ii-03
2013-06-11 10:20:00,034 INFO  [org.ovirt.engine.core.bll.AutoRecoveryManager] (QuartzScheduler_Worker-3) Autorecovering hosts id: 9f0aace8-e908-43e4-a6e4-39b1d6f538bb, name : dell-r210ii-03
2013-06-11 10:25:00,035 INFO  [org.ovirt.engine.core.bll.AutoRecoveryManager] (QuartzScheduler_Worker-48) Autorecovering hosts id: 9f0aace8-e908-43e4-a6e4-39b1d6f538bb, name : dell-r210ii-03
2013-06-11 10:30:00,034 INFO  [org.ovirt.engine.core.bll.AutoRecoveryManager] (QuartzScheduler_Worker-24) Autorecovering hosts id: 9f0aace8-e908-43e4-a6e4-39b1d6f538bb, name : dell-r210ii-03
2013-06-11 10:35:00,034 INFO  [org.ovirt.engine.core.bll.AutoRecoveryManager] (QuartzScheduler_Worker-60) Autorecovering hosts id: 9f0aace8-e908-43e4-a6e4-39b1d6f538bb, name : dell-r210ii-03
2013-06-11 10:40:00,035 INFO  [org.ovirt.engine.core.bll.AutoRecoveryManager] (QuartzScheduler_Worker-31) Autorecovering hosts id: 9f0aace8-e908-43e4-a6e4-39b1d6f538bb, name : dell-r210ii-03

Which fails regularily because the storage domain is unreachable:
$ grep -R 'SetVdsStatusVDSCommand.*9f0aace8-e908-43e4-a6e4-39b1d6f538bb' engine.log 
2013-06-11 10:13:31,421 INFO  [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (pool-4-thread-47) [21623f75] START, SetVdsStatusVDSCommand(HostName = dell-r210ii-03, HostId = 9f0aace8-e908-43e4-a6e4-39b1d6f538bb, status=NonOperational, nonOperationalReason=STORAGE_DOMAIN_UNREACHABLE), log id: 2ca6090c
2013-06-11 10:15:00,099 INFO  [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (QuartzScheduler_Worker-92) [2e7080fb] START, SetVdsStatusVDSCommand(HostName = dell-r210ii-03, HostId = 9f0aace8-e908-43e4-a6e4-39b1d6f538bb, status=Unassigned, nonOperationalReason=NONE), log id: 27ab5203
2013-06-11 10:18:52,835 INFO  [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (QuartzScheduler_Worker-89) [4ad9b27] START, SetVdsStatusVDSCommand(HostName = dell-r210ii-03, HostId = 9f0aace8-e908-43e4-a6e4-39b1d6f538bb, status=NonOperational, nonOperationalReason=STORAGE_DOMAIN_UNREACHABLE), log id: 35f8f51e
2013-06-11 10:18:53,013 INFO  [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (pool-4-thread-47) [ac9868a] START, SetVdsStatusVDSCommand(HostName = dell-r210ii-03, HostId = 9f0aace8-e908-43e4-a6e4-39b1d6f538bb, status=NonOperational, nonOperationalReason=STORAGE_DOMAIN_UNREACHABLE), log id: 1bbeb7f6
2013-06-11 10:20:00,100 INFO  [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (QuartzScheduler_Worker-3) [4dfee9cd] START, SetVdsStatusVDSCommand(HostName = dell-r210ii-03, HostId = 9f0aace8-e908-43e4-a6e4-39b1d6f538bb, status=Unassigned, nonOperationalReason=NONE), log id: 39bb38e8
2013-06-11 10:22:52,847 INFO  [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (QuartzScheduler_Worker-2) [2ee85c8] START, SetVdsStatusVDSCommand(HostName = dell-r210ii-03, HostId = 9f0aace8-e908-43e4-a6e4-39b1d6f538bb, status=NonOperational, nonOperationalReason=STORAGE_DOMAIN_UNREACHABLE), log id: 5e8bb892
2013-06-11 10:25:00,103 INFO  [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (QuartzScheduler_Worker-48) [290d311] START, SetVdsStatusVDSCommand(HostName = dell-r210ii-03, HostId = 9f0aace8-e908-43e4-a6e4-39b1d6f538bb, status=Unassigned, nonOperationalReason=NONE), log id: 5e5aed74
2013-06-11 10:27:04,716 INFO  [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (QuartzScheduler_Worker-56) [298825fc] START, SetVdsStatusVDSCommand(HostName = dell-r210ii-03, HostId = 9f0aace8-e908-43e4-a6e4-39b1d6f538bb, status=NonOperational, nonOperationalReason=STORAGE_DOMAIN_UNREACHABLE), log id: 3d1692
2013-06-11 10:30:00,098 INFO  [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (QuartzScheduler_Worker-24) [34aabe9d] START, SetVdsStatusVDSCommand(HostName = dell-r210ii-03, HostId = 9f0aace8-e908-43e4-a6e4-39b1d6f538bb, status=Unassigned, nonOperationalReason=NONE), log id: 54d3f4f
2013-06-11 10:33:16,853 INFO  [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (QuartzScheduler_Worker-35) [6ac7970a] START, SetVdsStatusVDSCommand(HostName = dell-r210ii-03, HostId = 9f0aace8-e908-43e4-a6e4-39b1d6f538bb, status=NonOperational, nonOperationalReason=STORAGE_DOMAIN_UNREACHABLE), log id: 101ccea1
2013-06-11 10:35:00,100 INFO  [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (QuartzScheduler_Worker-60) [1c74defd] START, SetVdsStatusVDSCommand(HostName = dell-r210ii-03, HostId = 9f0aace8-e908-43e4-a6e4-39b1d6f538bb, status=Unassigned, nonOperationalReason=NONE), log id: 4cb9a459
2013-06-11 10:37:18,965 INFO  [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (QuartzScheduler_Worker-59) [237f90c] START, SetVdsStatusVDSCommand(HostName = dell-r210ii-03, HostId = 9f0aace8-e908-43e4-a6e4-39b1d6f538bb, status=NonOperational, nonOperationalReason=STORAGE_DOMAIN_UNREACHABLE), log id: 55b58939
2013-06-11 10:40:00,098 INFO  [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (QuartzScheduler_Worker-31) [f7a20a0] START, SetVdsStatusVDSCommand(HostName = dell-r210ii-03, HostId = 9f0aace8-e908-43e4-a6e4-39b1d6f538bb, status=Unassigned, nonOperationalReason=NONE), log id: 17c3083f

Comment 10 Francesco Romani 2014-04-16 10:45:45 UTC

After deeper inspection, seems strange that the autorecovery started while the domain was still unreachable.

one VDSM seems OK (host 4):

Thread-2102::DEBUG::2013-06-11 10:05:04,326::resourceManager::197::ResourceManager.Request::(__init__) ResName=`Storage.5849b030-626e-47cb-ad90-3ce782d831b3`ReqID=`acf1bd24-1e70-4e36-9bf1-5b0f69f60844`::Request was made in '/usr/share/vdsm/storage/hsm.py' line '2478' at 'getStoragePoolInfo'
Thread-2102::DEBUG::2013-06-11 10:05:04,326::resourceManager::541::ResourceManager::(registerResource) Trying to register resource 'Storage.5849b030-626e-47cb-ad90-3ce782d831b3' for lock type 'shared'
Thread-2102::DEBUG::2013-06-11 10:05:04,326::resourceManager::600::ResourceManager::(registerResource) Resource 'Storage.5849b030-626e-47cb-ad90-3ce782d831b3' is free. Now locking as 'shared' (1 active user)
Thread-2102::DEBUG::2013-06-11 10:05:04,326::resourceManager::237::ResourceManager.Request::(grant) ResName=`Storage.5849b030-626e-47cb-ad90-3ce782d831b3`ReqID=`acf1bd24-1e70-4e36-9bf1-5b0f69f60844`::Granted request
Thread-2102::DEBUG::2013-06-11 10:05:04,339::task::811::TaskManager.Task::(resourceAcquired) Task=`1c0ae7c4-67ea-451b-b3aa-20f730cb9e5c`::_resourcesAcquired: Storage.5849b030-626e-47cb-ad90-3ce782d831b3 (shared)
Thread-2102::DEBUG::2013-06-11 10:05:04,340::task::974::TaskManager.Task::(_decref) Task=`1c0ae7c4-67ea-451b-b3aa-20f730cb9e5c`::ref 1 aborting False
Thread-2102::INFO::2013-06-11 10:05:04,341::logUtils::47::dispatcher::(wrapper) Run and protect: getStoragePoolInfo, Return response: {'info': {'spm_id': 2, 'master_uuid': 'f15b179b-bbca-4d19-bc1b-fccb2e2fad95', 'name': 'Default', 'version': '3', 'domains': '0c78b4d6-ba00-4d3e-9f9f-65c7d5899d71:Active,f15b179b-bbca-4d19-bc1b-fccb2e2fad95:Active', 'pool_status': 'connected', 'isoprefix': '/rhev/data-center/mnt/10.34.63.204:_home_iso_shared/0c78b4d6-ba00-4d3e-9f9f-65c7d5899d71/images/11111111-1111-1111-1111-111111111111', 'type': 'NFS', 'master_ver': 1, 'lver': 2}, 'dominfo': {'0c78b4d6-ba00-4d3e-9f9f-65c7d5899d71': {'status': 'Active', 'diskfree': '74501849088', 'alerts': [], 'version': 0, 'disktotal': '471458643968'}, 'f15b179b-bbca-4d19-bc1b-fccb2e2fad95': {'status': 'Active', 'diskfree': '265512976384', 'alerts': [], 'version': 3, 'disktotal': '422943129600'}}}
Thread-2102::DEBUG::2013-06-11 10:05:04,341::task::1168::TaskManager.Task::(prepare) Task=`1c0ae7c4-67ea-451b-b3aa-20f730cb9e5c`::finished: {'info': {'spm_id': 2, 'master_uuid': 'f15b179b-bbca-4d19-bc1b-fccb2e2fad95', 'name': 'Default', 'version': '3', 'domains': '0c78b4d6-ba00-4d3e-9f9f-65c7d5899d71:Active,f15b179b-bbca-4d19-bc1b-fccb2e2fad95:Active', 'pool_status': 'connected', 'isoprefix': '/rhev/data-center/mnt/10.34.63.204:_home_iso_shared/0c78b4d6-ba00-4d3e-9f9f-65c7d5899d71/images/11111111-1111-1111-1111-111111111111', 'type': 'NFS', 'master_ver': 1, 'lver': 2}, 'dominfo': {'0c78b4d6-ba00-4d3e-9f9f-65c7d5899d71': {'status': 'Active', 'diskfree': '74501849088', 'alerts': [], 'version': 0, 'disktotal': '471458643968'}, 'f15b179b-bbca-4d19-bc1b-fccb2e2fad95': {'status': 'Active', 'diskfree': '265512976384', 'alerts': [], 'version': 3, 'disktotal': '422943129600'}}}
Thread-2102::DEBUG::2013-06-11 10:05:04,341::task::579::TaskManager.Task::(_updateState) Task=`1c0ae7c4-67ea-451b-b3aa-20f730cb9e5c`::moving from state preparing -> state finished


But the other seems to be failing in the domain monitor (host 3)
Thread-23::ERROR::2013-06-11 10:09:29,833::domainMonitor::225::Storage.DomainMonitorThread::(_monitorDomain) Error while collecting domain f15b179b-bbca-4d19-bc1b-fccb2e2fad95 monitoring information
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/domainMonitor.py", line 201, in _monitorDomain
    self.domain.selftest()
  File "/usr/share/vdsm/storage/nfsSD.py", line 108, in selftest
    fileSD.FileStorageDomain.selftest(self)
  File "/usr/share/vdsm/storage/fileSD.py", line 505, in selftest
    self.oop.os.statvfs(self.domaindir)
  File "/usr/share/vdsm/storage/remoteFileHandler.py", line 275, in callCrabRPCFunction
    *args, **kwargs)
  File "/usr/share/vdsm/storage/remoteFileHandler.py", line 180, in callCrabRPCFunction
    rawLength = self._recvAll(LENGTH_STRUCT_LENGTH, timeout)
  File "/usr/share/vdsm/storage/remoteFileHandler.py", line 146, in _recvAll
    raise Timeout()
Timeout

It seems to me the domain monitor itself is failing and that could explain the constantly failing autorecovery. Moving to storage for futher investigation on this lead.

Comment 11 Federico Simoncelli 2014-04-17 16:01:40 UTC

The fact that the autorecovery tries every 5 minutes to reactivate the host is unrelated to the migration being stuck. The connection to the domain was blocked around 10:08:27:

Thread-199::INFO::2013-06-11 10:08:27,538::logUtils::42::dispatcher::(wrapper) Run and protect: repoStats, Return response: {'0c78b4d6-ba00-4d3e-9f9f-65c7d5899d71': {'delay': '0.00301504135132', 'lastCheck': '7.8', 'code': 0, 'valid': True, 'version': 0}, 'f15b179b-bbca-4d19-bc1b-fccb2e2fad95': {'delay': '0.00330400466919', 'lastCheck': '7.7', 'code': 0, 'valid': True, 'version': 3}}

Thread-205::INFO::2013-06-11 10:08:42,930::logUtils::42::dispatcher::(wrapper) Run and protect: repoStats, Return response: {'0c78b4d6-ba00-4d3e-9f9f-65c7d5899d71': {'delay': '0.00301480293274', 'lastCheck': '3.1', 'code': 0, 'valid': True, 'version': 0}, 'f15b179b-bbca-4d19-bc1b-fccb2e2fad95': {'delay': '0.00330400466919', 'lastCheck': '23.1', 'code': 0, 'valid': True, 'version': 3}}

Thread-211::INFO::2013-06-11 10:08:58,308::logUtils::42::dispatcher::(wrapper) Run and protect: repoStats, Return response: {'0c78b4d6-ba00-4d3e-9f9f-65c7d5899d71': {'delay': '0.00298094749451', 'lastCheck': '8.5', 'code': 0, 'valid': True, 'version': 0}, 'f15b179b-bbca-4d19-bc1b-fccb2e2fad95': {'delay': '0.00330400466919', 'lastCheck': '38.5', 'code': 0, 'valid': True, 'version': 3}}

Thread-217::INFO::2013-06-11 10:09:13,720::logUtils::42::dispatcher::(wrapper) Run and protect: repoStats, Return response: {'0c78b4d6-ba00-4d3e-9f9f-65c7d5899d71': {'delay': '0.00284099578857', 'lastCheck': '3.9', 'code': 0, 'valid': True, 'version': 0}, 'f15b179b-bbca-4d19-bc1b-fccb2e2fad95': {'delay': '0.00330400466919', 'lastCheck': '53.9', 'code': 0, 'valid': True, 'version': 3}}

Thread-223::INFO::2013-06-11 10:09:29,105::logUtils::42::dispatcher::(wrapper) Run and protect: repoStats, Return response: {'0c78b4d6-ba00-4d3e-9f9f-65c7d5899d71': {'delay': '0.00323295593262', 'lastCheck': '9.3', 'code': 0, 'valid': True, 'version': 0}, 'f15b179b-bbca-4d19-bc1b-fccb2e2fad95': {'delay': '0.00330400466919', 'lastCheck': '69.3', 'code': 0, 'valid': True, 'version': 3}}

Thread-23::ERROR::2013-06-11 10:09:29,833::domainMonitor::225::Storage.DomainMonitorThread::(_monitorDomain) Error while collecting domain f15b179b-bbca-4d19-bc1b-fccb2e2fad95 monitoring information
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/domainMonitor.py", line 201, in _monitorDomain
    self.domain.selftest()
  File "/usr/share/vdsm/storage/nfsSD.py", line 108, in selftest
    fileSD.FileStorageDomain.selftest(self)
  File "/usr/share/vdsm/storage/fileSD.py", line 505, in selftest
    self.oop.os.statvfs(self.domaindir)
  File "/usr/share/vdsm/storage/remoteFileHandler.py", line 275, in callCrabRPCFunction
    *args, **kwargs)
  File "/usr/share/vdsm/storage/remoteFileHandler.py", line 180, in callCrabRPCFunction
    rawLength = self._recvAll(LENGTH_STRUCT_LENGTH, timeout)
  File "/usr/share/vdsm/storage/remoteFileHandler.py", line 146, in _recvAll
    raise Timeout()
Timeout

The migration request arrived at 10:13:59,035 and never progressed over 90%:

Thread-335::DEBUG::2013-06-11 10:13:59,035::BindingXMLRPC::913::vds::(wrapper) client [10.34.63.76]::call vmMigrate with ({'src': '10.34.63.222', 'dst': '10.34.63.223:54321', 'vmId': 'f7fe0970-34aa-40ff-8a46-d0ad16698a7c', 'method': 'online'},) {} flowID [21623f75]

Thread-338::INFO::2013-06-11 10:14:09,199::libvirtvm::427::vm.Vm::(run) vmId=`f7fe0970-34aa-40ff-8a46-d0ad16698a7c`::Migration Progress: 9 seconds elapsed, 71% of data processed, 71% of mem processed
...
Thread-338::INFO::2013-06-11 10:14:19,200::libvirtvm::427::vm.Vm::(run) vmId=`f7fe0970-34aa-40ff-8a46-d0ad16698a7c`::Migration Progress: 20 seconds elapsed, 90% of data processed, 90% of mem processed
...
Thread-338::INFO::2013-06-11 10:15:09,204::libvirtvm::427::vm.Vm::(run) vmId=`f7fe0970-34aa-40ff-8a46-d0ad16698a7c`::Migration Progress: 70 seconds elapsed, 90% of data processed, 90% of mem processed
...
Thread-338::WARNING::2013-06-11 10:19:19,235::libvirtvm::405::vm.Vm::(run) vmId=`f7fe0970-34aa-40ff-8a46-d0ad16698a7c`::Migration is stuck: Hasn't progressed in 300.034855127 seconds. Aborting.

After that we started seeing the usual "cannot acquire state change lock" errors:

Thread-105::DEBUG::2013-06-11 10:15:39,001::libvirtconnection::128::vds::(wrapper) Unknown libvirterror: ecode: 68 edom: 10 level: 2 message: Timed out during operation: cannot acquire state change lock
Thread-105::ERROR::2013-06-11 10:15:39,001::utils::416::vm.Vm::(collect) vmId=`f7fe0970-34aa-40ff-8a46-d0ad16698a7c`::Stats function failed: <AdvancedStatsFunction _sampleDisk at 0x2911c38>
Traceback (most recent call last):
  File "/usr/lib64/python2.6/site-packages/vdsm/utils.py", line 412, in collect
    statsFunction()
  File "/usr/lib64/python2.6/site-packages/vdsm/utils.py", line 287, in __call__
    retValue = self._function(*args, **kwargs)
  File "/usr/share/vdsm/libvirtvm.py", line 164, in _sampleDisk
    diskSamples[vmDrive.name] = self._vm._dom.blockStats(vmDrive.name)
  File "/usr/share/vdsm/libvirtvm.py", line 539, in f
    raise toe
TimeoutError: Timed out during operation: cannot acquire state change lock

I don't know if the migration really never progressed over 90% or if it was just libvirt not being available to be queried (cannot acquire state change lock).

Comment 12 Francesco Romani 2014-04-18 13:44:38 UTC

retaking the bug to address the migration not completed.

Comment 13 Francesco Romani 2014-05-08 14:47:19 UTC

Libvirt produces "cannot acquire state change lock" when it cannot start a QEMU job, because there is one currently active and nesting is not allowed, or because the nesting limit has been reached. However, in the latter case, the log message will be slightly different.

Given the sequence of events, the fact that the NFS domain was blocked and the error reported, the most likely cause of this is the QEMU process stuck on D state on I/O; In turn, a likely cause is the virDomainGetBlockJobInfo call VDSM does for high water mark in sampling, because the disk is configured as such

		<disk device="disk" snapshot="no" type="file">
			<address  domain="0x0000"  function="0x0"  slot="0x05"  type="pci" bus="0x00"/>
			<source file="/rhev/data-center/5849b030-626e-47cb-ad90-3ce782d831b3/f15b179b-bbca-4d19-bc1b-fccb2e2fad95/images/3afcd443-c416-4897-b866-6e618df20302/194578cf-affa-47ff-afb7-01977b3d87b6"/>
			<target bus="virtio" dev="vda"/>
			<serial>3afcd443-c416-4897-b866-6e618df20302</serial>
			<boot order="1"/>
			<driver cache="none" error_policy="stop" io="threads" name="qemu" type="qcow2"/>
		</disk>

(type="qcow2")

And this is enough to explain the failed migration.

Comment 14 Michal Skrivanek 2014-05-13 10:37:27 UTC

migration aside, I think the question still remains - why does the autorecovery tries to activate the host when the connection is still severed? I thought this is not expected

Comment 15 Federico Simoncelli 2014-05-13 22:14:32 UTC

(In reply to Michal Skrivanek from comment #14)
> migration aside, I think the question still remains - why does the
> autorecovery tries to activate the host when the connection is still
> severed? I thought this is not expected

How is that relevant to the live migration? Even if we don't try to autorecover every 5 minutes the libvirt process is still stuck on "cannot acquire state change lock" and migration won't finish.

Comment 16 Michal Skrivanek 2014-05-20 13:44:51 UTC

(In reply to Federico Simoncelli from comment #15)
It is not relevant to migration but that's not the point of the bug. It's about the host state. See in original description "its status is changing from 'Non-Operational' to 'Up' and back all the time"

Comment 17 Federico Simoncelli 2014-05-20 22:16:50 UTC

(In reply to Michal Skrivanek from comment #16)
> (In reply to Federico Simoncelli from comment #15)
> It is not relevant to migration but that's not the point of the bug. It's
> about the host state. See in original description "its status is changing
> from 'Non-Operational' to 'Up' and back all the time"

The original description starts with:

"I had a VM running on a host and I disabled access to storage nfs with iptables on this host. Then migration of this VM has been started but it does never finish."

Now, I don't see how the host status changing from 'Non-Operational' to 'Up' (and back) is affecting the situation. You are not able to manage the host (e.g. move it to maintenance, etc.) because of the running VM not because of the status.

We can spin off another bug and just have the status being in 'Non-Operational' but then I don't see any gain, you still can't migrate the VM, you can't manage the host, etc.

Comment 18 Michal Skrivanek 2014-05-21 06:44:10 UTC

it's not affecting the migration indeed, my whole point since comment #10 is the flapping host status. I'll change the title to make it clear and let's call it a new bug then:)

The code may have change since the bug is almost a year old, but I'd still suggest a review of the domain autorecovery

Comment 19 Federico Simoncelli 2014-05-21 08:08:33 UTC

(In reply to Michal Skrivanek from comment #18)
> it's not affecting the migration indeed, my whole point since comment #10 is
> the flapping host status. I'll change the title to make it clear and let's
> call it a new bug then:)

I think you missed the point. Now we have a new irrelevant bug and we lost track of an important one. How do you think to track the migration issue?

Comment 20 Michal Skrivanek 2014-05-21 10:50:44 UTC

there is no migration issue here. The fact that it didn't finish is expected. The fact that it took "forever" has been addressed by patches during 3.3.z and 3.4(e.g. bug 1028917).

The only thing I want to keep track of is this "other" issue with host status.

Comment 23 Omer Frenkel 2015-07-07 08:17:24 UTC

according to comment 20 the migration issue is tracked (and solved) in a different bug, so this bug should only cover the host state issue.
moving to storage to have a look, there is nothing else related to virt.

Comment 24 Allon Mureinik 2015-07-07 09:20:39 UTC

I wonder if this still happens in 3.6.0.
Aharon - do you guys have such a test scenario (block NFS storage from non-SPM host)?

Comment 25 Aharon Canan 2015-07-07 15:15:31 UTC

We have such a case but not sure it si on HSM
We can retest it

Comment 26 Yaniv Lavi 2015-10-22 08:16:01 UTC

Can you please retest?

Comment 27 Aharon Canan 2015-11-01 15:37:21 UTC

Elad, Please do.

Comment 28 Elad 2015-11-02 09:39:08 UTC

Tested using:
rhevm-3.6.0.2-0.1.el6.noarch
vdsm-4.17.8-1.el7ev.noarch

- SPM host configured with power mgmt while the second host without. 
- Started a VM with disk attached located on a non-master domain (NFS) on the non-SPM host. Initiated OS installation on the VM so the VM will write to the disk.
- Blocked connectivity between the non-SPM host (where the qemu process is running) to the storage domain where the VM disk is located.


2015-11-02 09:06:33,380 WARN  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-7-thread-46) [] Storage domain '78a4a497-4dd7-4cf5-b7a8-264a8d467f05:exportDomain' is not visible to on
e or more hosts. Since the domain's type is 'ImportExport', hosts status will not be changed to non-operational


2015-11-02 09:06:34,558 INFO  [org.ovirt.engine.core.vdsbroker.MigrateVDSCommand] (org.ovirt.thread.pool-7-thread-6) [5e622bee] START, MigrateVDSCommand( MigrateVDSCommandParameters:{runAsync='true', hostId='db6cd47a-68cb-4946-9622-02a073bb1ded', vmId='f8114073-a5ae-4df5-b263-837880160aa7', srcHost='aqua-vds4.qa.lab.tlv.redhat.com', dstVdsId='98f79a0b-d240-4ae5-9a89-4b4f76b7e5cb', dstHost='aqua-vds5.qa.lab.tlv.redhat.com:54321', migrationMethod='ONLINE', tunnelMigration='false', migrationDowntime='0', autoConverge='false', migrateCompressed='false', consoleAddress='null'}), log id: 6bcc93e0



Results:
- Host is flapping between 'connecting' and 'non-operational' (as should be).
- VM migration fails.


Logs from engine and hosts will be attached

Comment 29 Elad 2015-11-02 09:45:58 UTC

Created attachment 1088501 [details]
logs-3.6

Comment 30 Red Hat Bugzilla Rules Engine 2015-11-30 18:59:13 UTC

Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.

Comment 31 Allon Mureinik 2016-04-11 12:14:49 UTC

(In reply to Elad from comment #28)
> Results:
> - Host is flapping between 'connecting' and 'non-operational' (as should be).
> - VM migration fails.

So, what's the AI here?

Comment 32 Sandro Bonazzola 2016-05-02 10:05:28 UTC

Moving from 4.0 alpha to 4.0 beta since 4.0 alpha has been already released and bug is not ON_QA.

Comment 33 Liron Aravot 2016-05-08 08:14:42 UTC

As Elad tests results were as exepected it seems as there's no issue here, closing.
please reopen if relevant.

Note You need to log in before you can comment on or make changes to this bug.