Created attachment 759485 [details] engine.log and vdsm.log(s) Description of problem: I had a VM running on a host and I disabled access to storage nfs with iptables on this host. Then migration of this VM has been started but it does never finish. Thus the host is still up and its status is changing from 'Non-Operational' to 'Up' and back all the time. Power mgmt was setup OK and test worked OK. Version-Release number of selected component (if applicable): sf17.5 How reproducible: 100% Steps to Reproduce: 1. 2 hosts with power mgmt setup and working 2. start a vm 3. on the host running the vm disable with iptables access to data SD Actual results: VM is in process of migration for > 30 minutes, host status is switching between 'Up' and 'Non-Operational' status all the time Expected results: if migration is choosed, it should be checked that it really works and if not some timeout some be implemented and then the host should be fenced anyway. Additional info: * source host: [root@dell-r210ii-03 ~]# ps ax | grep '[q]emu-kvm' 8133 ? Sl 0:19 /usr/libexec/qemu-kvm -name jb-rh6-x64 -S -M rhel6.4.0 -cpu SandyBridge -enable-kvm -m 512 -smp 1,sockets=1,cores=1,threads=1 -uuid f7fe0970-34aa-40ff-8a46-d0ad16698a7c -smbios type=1,manufacturer=Red Hat,product=RHEV Hypervisor,version=6Server-6.4.0.4.el6,serial=4C4C4544-0058-3410-8058-C3C04F38354A,uuid=f7fe0970-34aa-40ff-8a46-d0ad16698a7c -nodefconfig -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/jb-rh6-x64.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=2013-06-11T10:05:19,driftfix=slew -no-shutdown -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x4 -drive if=none,media=cdrom,id=drive-ide0-1-0,readonly=on,format=raw,serial= -device ide-drive,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -drive file=/rhev/data-center/5849b030-626e-47cb-ad90-3ce782d831b3/f15b179b-bbca-4d19-bc1b-fccb2e2fad95/images/3afcd443-c416-4897-b866-6e618df20302/194578cf-affa-47ff-afb7-01977b3d87b6,if=none,id=drive-virtio-disk0,format=qcow2,serial=3afcd443-c416-4897-b866-6e618df20302,cache=none,werror=stop,rerror=stop,aio=threads -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -netdev tap,fd=27,id=hostnet0,vhost=on,vhostfd=28 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=00:1a:4a:e0:3f:86,bus=pci.0,addr=0x3 -chardev socket,id=charchannel0,path=/var/lib/libvirt/qemu/channels/jb-rh6-x64.com.redhat.rhevm.vdsm,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=com.redhat.rhevm.vdsm -chardev socket,id=charchannel1,path=/var/lib/libvirt/qemu/channels/jb-rh6-x64.org.qemu.guest_agent.0,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=2,chardev=charchannel1,id=channel1,name=org.qemu.guest_agent.0 -chardev spicevmc,id=charchannel2,name=vdagent -device virtserialport,bus=virtio-serial0.0,nr=3,chardev=charchannel2,id=channel2,name=com.redhat.spice.0 -spice port=5900,tls-port=5901,addr=0,x509-dir=/etc/pki/vdsm/libvirt-spice,tls-channel=main,tls-channel=display,tls-channel=inputs,tls-channel=cursor,tls-channel=playback,tls-channel=record,tls-channel=smartcard,tls-channel=usbredir,seamless-migration=on -k en-us -vga qxl -global qxl-vga.ram_size=67108864 -global qxl-vga.vram_size=67108864 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6 [root@dell-r210ii-03 ~]# Message from syslogd@dell-r210ii-03 at Jun 11 10:52:12 ... �<11>vdsm vm.Vm ERROR vmId=`f7fe0970-34aa-40ff-8a46-d0ad16698a7c`::Unable to update the volume 194578cf-affa-47ff-afb7-01977b3d87b6 (domain: f15b179b-bbca-4d19-bc1b-fccb2e2fad95 image: 3afcd443-c416-4897-b866-6e618df20302) for the drive vda * destination host: [root@dell-r210ii-04 ~]# ps ax | grep '[q]emu-kvm' 7576 ? Sl 0:00 /usr/libexec/qemu-kvm -name jb-rh6-x64 -S -M rhel6.4.0 -cpu SandyBridge -enable-kvm -m 512 -smp 1,sockets=1,cores=1,threads=1 -uuid f7fe0970-34aa-40ff-8a46-d0ad16698a7c -smbios type=1,manufacturer=Red Hat,product=RHEV Hypervisor,version=6Server-6.4.0.4.el6,serial=4C4C4544-0058-3410-8058-C3C04F38354A,uuid=f7fe0970-34aa-40ff-8a46-d0ad16698a7c -nodefconfig -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/jb-rh6-x64.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=2013-06-11T10:13:59,driftfix=slew -no-shutdown -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x4 -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive if=none,media=cdrom,id=drive-ide0-1-0,readonly=on,format=raw,serial= -device ide-drive,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -drive file=/rhev/data-center/5849b030-626e-47cb-ad90-3ce782d831b3/f15b179b-bbca-4d19-bc1b-fccb2e2fad95/images/3afcd443-c416-4897-b866-6e618df20302/194578cf-affa-47ff-afb7-01977b3d87b6,if=none,id=drive-virtio-disk0,format=qcow2,serial=3afcd443-c416-4897-b866-6e618df20302,cache=none,werror=stop,rerror=stop,aio=threads -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -netdev tap,fd=29,id=hostnet0,vhost=on,vhostfd=30 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=00:1a:4a:e0:3f:86,bus=pci.0,addr=0x3 -chardev socket,id=charchannel0,path=/var/lib/libvirt/qemu/channels/jb-rh6-x64.com.redhat.rhevm.vdsm,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=com.redhat.rhevm.vdsm -chardev socket,id=charchannel1,path=/var/lib/libvirt/qemu/channels/jb-rh6-x64.org.qemu.guest_agent.0,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=2,chardev=charchannel1,id=channel1,name=org.qemu.guest_agent.0 -chardev spicevmc,id=charchannel2,name=vdagent -device virtserialport,bus=virtio-serial0.0,nr=3,chardev=charchannel2,id=channel2,name=com.redhat.spice.0 -spice port=5900,tls-port=5901,addr=0,x509-dir=/etc/pki/vdsm/libvirt-spice,tls-channel=main,tls-channel=display,tls-channel=inputs,tls-channel=cursor,tls-channel=playback,tls-channel=record,tls-channel=smartcard,tls-channel=usbredir,seamless-migration=on -k en-us -vga qxl -global qxl-vga.ram_size=67108864 -global qxl-vga.vram_size=67108864 -incoming tcp:[::]:49152 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6
Modified summary per hateya@ request as fencing should not be done on non-SPM hosts. Thus the issue remais never ended migration.
maybe related to recent fencing changes?
Francesco, can you please check if this is relavted to the host status flapping on storage issue. The migration will obviously never finish if the VM can't access storage, so only the flapping issue remains
taking the bug
The symptoms look different with respect to https://bugzilla.redhat.com/show_bug.cgi?id=866484 So, will look to understand why this one has the flapping status.
The reason why the host goes back and forth between 'NonOperational' and 'Up' seem the normal autorecovery flow: $ grep -i 'autorecovering host' engine.log 2013-06-11 10:15:00,035 INFO [org.ovirt.engine.core.bll.AutoRecoveryManager] (QuartzScheduler_Worker-92) Autorecovering hosts id: 9f0aace8-e908-43e4-a6e4-39b1d6f538bb, name : dell-r210ii-03 2013-06-11 10:20:00,034 INFO [org.ovirt.engine.core.bll.AutoRecoveryManager] (QuartzScheduler_Worker-3) Autorecovering hosts id: 9f0aace8-e908-43e4-a6e4-39b1d6f538bb, name : dell-r210ii-03 2013-06-11 10:25:00,035 INFO [org.ovirt.engine.core.bll.AutoRecoveryManager] (QuartzScheduler_Worker-48) Autorecovering hosts id: 9f0aace8-e908-43e4-a6e4-39b1d6f538bb, name : dell-r210ii-03 2013-06-11 10:30:00,034 INFO [org.ovirt.engine.core.bll.AutoRecoveryManager] (QuartzScheduler_Worker-24) Autorecovering hosts id: 9f0aace8-e908-43e4-a6e4-39b1d6f538bb, name : dell-r210ii-03 2013-06-11 10:35:00,034 INFO [org.ovirt.engine.core.bll.AutoRecoveryManager] (QuartzScheduler_Worker-60) Autorecovering hosts id: 9f0aace8-e908-43e4-a6e4-39b1d6f538bb, name : dell-r210ii-03 2013-06-11 10:40:00,035 INFO [org.ovirt.engine.core.bll.AutoRecoveryManager] (QuartzScheduler_Worker-31) Autorecovering hosts id: 9f0aace8-e908-43e4-a6e4-39b1d6f538bb, name : dell-r210ii-03 Which fails regularily because the storage domain is unreachable: $ grep -R 'SetVdsStatusVDSCommand.*9f0aace8-e908-43e4-a6e4-39b1d6f538bb' engine.log 2013-06-11 10:13:31,421 INFO [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (pool-4-thread-47) [21623f75] START, SetVdsStatusVDSCommand(HostName = dell-r210ii-03, HostId = 9f0aace8-e908-43e4-a6e4-39b1d6f538bb, status=NonOperational, nonOperationalReason=STORAGE_DOMAIN_UNREACHABLE), log id: 2ca6090c 2013-06-11 10:15:00,099 INFO [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (QuartzScheduler_Worker-92) [2e7080fb] START, SetVdsStatusVDSCommand(HostName = dell-r210ii-03, HostId = 9f0aace8-e908-43e4-a6e4-39b1d6f538bb, status=Unassigned, nonOperationalReason=NONE), log id: 27ab5203 2013-06-11 10:18:52,835 INFO [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (QuartzScheduler_Worker-89) [4ad9b27] START, SetVdsStatusVDSCommand(HostName = dell-r210ii-03, HostId = 9f0aace8-e908-43e4-a6e4-39b1d6f538bb, status=NonOperational, nonOperationalReason=STORAGE_DOMAIN_UNREACHABLE), log id: 35f8f51e 2013-06-11 10:18:53,013 INFO [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (pool-4-thread-47) [ac9868a] START, SetVdsStatusVDSCommand(HostName = dell-r210ii-03, HostId = 9f0aace8-e908-43e4-a6e4-39b1d6f538bb, status=NonOperational, nonOperationalReason=STORAGE_DOMAIN_UNREACHABLE), log id: 1bbeb7f6 2013-06-11 10:20:00,100 INFO [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (QuartzScheduler_Worker-3) [4dfee9cd] START, SetVdsStatusVDSCommand(HostName = dell-r210ii-03, HostId = 9f0aace8-e908-43e4-a6e4-39b1d6f538bb, status=Unassigned, nonOperationalReason=NONE), log id: 39bb38e8 2013-06-11 10:22:52,847 INFO [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (QuartzScheduler_Worker-2) [2ee85c8] START, SetVdsStatusVDSCommand(HostName = dell-r210ii-03, HostId = 9f0aace8-e908-43e4-a6e4-39b1d6f538bb, status=NonOperational, nonOperationalReason=STORAGE_DOMAIN_UNREACHABLE), log id: 5e8bb892 2013-06-11 10:25:00,103 INFO [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (QuartzScheduler_Worker-48) [290d311] START, SetVdsStatusVDSCommand(HostName = dell-r210ii-03, HostId = 9f0aace8-e908-43e4-a6e4-39b1d6f538bb, status=Unassigned, nonOperationalReason=NONE), log id: 5e5aed74 2013-06-11 10:27:04,716 INFO [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (QuartzScheduler_Worker-56) [298825fc] START, SetVdsStatusVDSCommand(HostName = dell-r210ii-03, HostId = 9f0aace8-e908-43e4-a6e4-39b1d6f538bb, status=NonOperational, nonOperationalReason=STORAGE_DOMAIN_UNREACHABLE), log id: 3d1692 2013-06-11 10:30:00,098 INFO [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (QuartzScheduler_Worker-24) [34aabe9d] START, SetVdsStatusVDSCommand(HostName = dell-r210ii-03, HostId = 9f0aace8-e908-43e4-a6e4-39b1d6f538bb, status=Unassigned, nonOperationalReason=NONE), log id: 54d3f4f 2013-06-11 10:33:16,853 INFO [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (QuartzScheduler_Worker-35) [6ac7970a] START, SetVdsStatusVDSCommand(HostName = dell-r210ii-03, HostId = 9f0aace8-e908-43e4-a6e4-39b1d6f538bb, status=NonOperational, nonOperationalReason=STORAGE_DOMAIN_UNREACHABLE), log id: 101ccea1 2013-06-11 10:35:00,100 INFO [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (QuartzScheduler_Worker-60) [1c74defd] START, SetVdsStatusVDSCommand(HostName = dell-r210ii-03, HostId = 9f0aace8-e908-43e4-a6e4-39b1d6f538bb, status=Unassigned, nonOperationalReason=NONE), log id: 4cb9a459 2013-06-11 10:37:18,965 INFO [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (QuartzScheduler_Worker-59) [237f90c] START, SetVdsStatusVDSCommand(HostName = dell-r210ii-03, HostId = 9f0aace8-e908-43e4-a6e4-39b1d6f538bb, status=NonOperational, nonOperationalReason=STORAGE_DOMAIN_UNREACHABLE), log id: 55b58939 2013-06-11 10:40:00,098 INFO [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (QuartzScheduler_Worker-31) [f7a20a0] START, SetVdsStatusVDSCommand(HostName = dell-r210ii-03, HostId = 9f0aace8-e908-43e4-a6e4-39b1d6f538bb, status=Unassigned, nonOperationalReason=NONE), log id: 17c3083f
After deeper inspection, seems strange that the autorecovery started while the domain was still unreachable. one VDSM seems OK (host 4): Thread-2102::DEBUG::2013-06-11 10:05:04,326::resourceManager::197::ResourceManager.Request::(__init__) ResName=`Storage.5849b030-626e-47cb-ad90-3ce782d831b3`ReqID=`acf1bd24-1e70-4e36-9bf1-5b0f69f60844`::Request was made in '/usr/share/vdsm/storage/hsm.py' line '2478' at 'getStoragePoolInfo' Thread-2102::DEBUG::2013-06-11 10:05:04,326::resourceManager::541::ResourceManager::(registerResource) Trying to register resource 'Storage.5849b030-626e-47cb-ad90-3ce782d831b3' for lock type 'shared' Thread-2102::DEBUG::2013-06-11 10:05:04,326::resourceManager::600::ResourceManager::(registerResource) Resource 'Storage.5849b030-626e-47cb-ad90-3ce782d831b3' is free. Now locking as 'shared' (1 active user) Thread-2102::DEBUG::2013-06-11 10:05:04,326::resourceManager::237::ResourceManager.Request::(grant) ResName=`Storage.5849b030-626e-47cb-ad90-3ce782d831b3`ReqID=`acf1bd24-1e70-4e36-9bf1-5b0f69f60844`::Granted request Thread-2102::DEBUG::2013-06-11 10:05:04,339::task::811::TaskManager.Task::(resourceAcquired) Task=`1c0ae7c4-67ea-451b-b3aa-20f730cb9e5c`::_resourcesAcquired: Storage.5849b030-626e-47cb-ad90-3ce782d831b3 (shared) Thread-2102::DEBUG::2013-06-11 10:05:04,340::task::974::TaskManager.Task::(_decref) Task=`1c0ae7c4-67ea-451b-b3aa-20f730cb9e5c`::ref 1 aborting False Thread-2102::INFO::2013-06-11 10:05:04,341::logUtils::47::dispatcher::(wrapper) Run and protect: getStoragePoolInfo, Return response: {'info': {'spm_id': 2, 'master_uuid': 'f15b179b-bbca-4d19-bc1b-fccb2e2fad95', 'name': 'Default', 'version': '3', 'domains': '0c78b4d6-ba00-4d3e-9f9f-65c7d5899d71:Active,f15b179b-bbca-4d19-bc1b-fccb2e2fad95:Active', 'pool_status': 'connected', 'isoprefix': '/rhev/data-center/mnt/10.34.63.204:_home_iso_shared/0c78b4d6-ba00-4d3e-9f9f-65c7d5899d71/images/11111111-1111-1111-1111-111111111111', 'type': 'NFS', 'master_ver': 1, 'lver': 2}, 'dominfo': {'0c78b4d6-ba00-4d3e-9f9f-65c7d5899d71': {'status': 'Active', 'diskfree': '74501849088', 'alerts': [], 'version': 0, 'disktotal': '471458643968'}, 'f15b179b-bbca-4d19-bc1b-fccb2e2fad95': {'status': 'Active', 'diskfree': '265512976384', 'alerts': [], 'version': 3, 'disktotal': '422943129600'}}} Thread-2102::DEBUG::2013-06-11 10:05:04,341::task::1168::TaskManager.Task::(prepare) Task=`1c0ae7c4-67ea-451b-b3aa-20f730cb9e5c`::finished: {'info': {'spm_id': 2, 'master_uuid': 'f15b179b-bbca-4d19-bc1b-fccb2e2fad95', 'name': 'Default', 'version': '3', 'domains': '0c78b4d6-ba00-4d3e-9f9f-65c7d5899d71:Active,f15b179b-bbca-4d19-bc1b-fccb2e2fad95:Active', 'pool_status': 'connected', 'isoprefix': '/rhev/data-center/mnt/10.34.63.204:_home_iso_shared/0c78b4d6-ba00-4d3e-9f9f-65c7d5899d71/images/11111111-1111-1111-1111-111111111111', 'type': 'NFS', 'master_ver': 1, 'lver': 2}, 'dominfo': {'0c78b4d6-ba00-4d3e-9f9f-65c7d5899d71': {'status': 'Active', 'diskfree': '74501849088', 'alerts': [], 'version': 0, 'disktotal': '471458643968'}, 'f15b179b-bbca-4d19-bc1b-fccb2e2fad95': {'status': 'Active', 'diskfree': '265512976384', 'alerts': [], 'version': 3, 'disktotal': '422943129600'}}} Thread-2102::DEBUG::2013-06-11 10:05:04,341::task::579::TaskManager.Task::(_updateState) Task=`1c0ae7c4-67ea-451b-b3aa-20f730cb9e5c`::moving from state preparing -> state finished But the other seems to be failing in the domain monitor (host 3) Thread-23::ERROR::2013-06-11 10:09:29,833::domainMonitor::225::Storage.DomainMonitorThread::(_monitorDomain) Error while collecting domain f15b179b-bbca-4d19-bc1b-fccb2e2fad95 monitoring information Traceback (most recent call last): File "/usr/share/vdsm/storage/domainMonitor.py", line 201, in _monitorDomain self.domain.selftest() File "/usr/share/vdsm/storage/nfsSD.py", line 108, in selftest fileSD.FileStorageDomain.selftest(self) File "/usr/share/vdsm/storage/fileSD.py", line 505, in selftest self.oop.os.statvfs(self.domaindir) File "/usr/share/vdsm/storage/remoteFileHandler.py", line 275, in callCrabRPCFunction *args, **kwargs) File "/usr/share/vdsm/storage/remoteFileHandler.py", line 180, in callCrabRPCFunction rawLength = self._recvAll(LENGTH_STRUCT_LENGTH, timeout) File "/usr/share/vdsm/storage/remoteFileHandler.py", line 146, in _recvAll raise Timeout() Timeout It seems to me the domain monitor itself is failing and that could explain the constantly failing autorecovery. Moving to storage for futher investigation on this lead.
The fact that the autorecovery tries every 5 minutes to reactivate the host is unrelated to the migration being stuck. The connection to the domain was blocked around 10:08:27: Thread-199::INFO::2013-06-11 10:08:27,538::logUtils::42::dispatcher::(wrapper) Run and protect: repoStats, Return response: {'0c78b4d6-ba00-4d3e-9f9f-65c7d5899d71': {'delay': '0.00301504135132', 'lastCheck': '7.8', 'code': 0, 'valid': True, 'version': 0}, 'f15b179b-bbca-4d19-bc1b-fccb2e2fad95': {'delay': '0.00330400466919', 'lastCheck': '7.7', 'code': 0, 'valid': True, 'version': 3}} Thread-205::INFO::2013-06-11 10:08:42,930::logUtils::42::dispatcher::(wrapper) Run and protect: repoStats, Return response: {'0c78b4d6-ba00-4d3e-9f9f-65c7d5899d71': {'delay': '0.00301480293274', 'lastCheck': '3.1', 'code': 0, 'valid': True, 'version': 0}, 'f15b179b-bbca-4d19-bc1b-fccb2e2fad95': {'delay': '0.00330400466919', 'lastCheck': '23.1', 'code': 0, 'valid': True, 'version': 3}} Thread-211::INFO::2013-06-11 10:08:58,308::logUtils::42::dispatcher::(wrapper) Run and protect: repoStats, Return response: {'0c78b4d6-ba00-4d3e-9f9f-65c7d5899d71': {'delay': '0.00298094749451', 'lastCheck': '8.5', 'code': 0, 'valid': True, 'version': 0}, 'f15b179b-bbca-4d19-bc1b-fccb2e2fad95': {'delay': '0.00330400466919', 'lastCheck': '38.5', 'code': 0, 'valid': True, 'version': 3}} Thread-217::INFO::2013-06-11 10:09:13,720::logUtils::42::dispatcher::(wrapper) Run and protect: repoStats, Return response: {'0c78b4d6-ba00-4d3e-9f9f-65c7d5899d71': {'delay': '0.00284099578857', 'lastCheck': '3.9', 'code': 0, 'valid': True, 'version': 0}, 'f15b179b-bbca-4d19-bc1b-fccb2e2fad95': {'delay': '0.00330400466919', 'lastCheck': '53.9', 'code': 0, 'valid': True, 'version': 3}} Thread-223::INFO::2013-06-11 10:09:29,105::logUtils::42::dispatcher::(wrapper) Run and protect: repoStats, Return response: {'0c78b4d6-ba00-4d3e-9f9f-65c7d5899d71': {'delay': '0.00323295593262', 'lastCheck': '9.3', 'code': 0, 'valid': True, 'version': 0}, 'f15b179b-bbca-4d19-bc1b-fccb2e2fad95': {'delay': '0.00330400466919', 'lastCheck': '69.3', 'code': 0, 'valid': True, 'version': 3}} Thread-23::ERROR::2013-06-11 10:09:29,833::domainMonitor::225::Storage.DomainMonitorThread::(_monitorDomain) Error while collecting domain f15b179b-bbca-4d19-bc1b-fccb2e2fad95 monitoring information Traceback (most recent call last): File "/usr/share/vdsm/storage/domainMonitor.py", line 201, in _monitorDomain self.domain.selftest() File "/usr/share/vdsm/storage/nfsSD.py", line 108, in selftest fileSD.FileStorageDomain.selftest(self) File "/usr/share/vdsm/storage/fileSD.py", line 505, in selftest self.oop.os.statvfs(self.domaindir) File "/usr/share/vdsm/storage/remoteFileHandler.py", line 275, in callCrabRPCFunction *args, **kwargs) File "/usr/share/vdsm/storage/remoteFileHandler.py", line 180, in callCrabRPCFunction rawLength = self._recvAll(LENGTH_STRUCT_LENGTH, timeout) File "/usr/share/vdsm/storage/remoteFileHandler.py", line 146, in _recvAll raise Timeout() Timeout The migration request arrived at 10:13:59,035 and never progressed over 90%: Thread-335::DEBUG::2013-06-11 10:13:59,035::BindingXMLRPC::913::vds::(wrapper) client [10.34.63.76]::call vmMigrate with ({'src': '10.34.63.222', 'dst': '10.34.63.223:54321', 'vmId': 'f7fe0970-34aa-40ff-8a46-d0ad16698a7c', 'method': 'online'},) {} flowID [21623f75] Thread-338::INFO::2013-06-11 10:14:09,199::libvirtvm::427::vm.Vm::(run) vmId=`f7fe0970-34aa-40ff-8a46-d0ad16698a7c`::Migration Progress: 9 seconds elapsed, 71% of data processed, 71% of mem processed ... Thread-338::INFO::2013-06-11 10:14:19,200::libvirtvm::427::vm.Vm::(run) vmId=`f7fe0970-34aa-40ff-8a46-d0ad16698a7c`::Migration Progress: 20 seconds elapsed, 90% of data processed, 90% of mem processed ... Thread-338::INFO::2013-06-11 10:15:09,204::libvirtvm::427::vm.Vm::(run) vmId=`f7fe0970-34aa-40ff-8a46-d0ad16698a7c`::Migration Progress: 70 seconds elapsed, 90% of data processed, 90% of mem processed ... Thread-338::WARNING::2013-06-11 10:19:19,235::libvirtvm::405::vm.Vm::(run) vmId=`f7fe0970-34aa-40ff-8a46-d0ad16698a7c`::Migration is stuck: Hasn't progressed in 300.034855127 seconds. Aborting. After that we started seeing the usual "cannot acquire state change lock" errors: Thread-105::DEBUG::2013-06-11 10:15:39,001::libvirtconnection::128::vds::(wrapper) Unknown libvirterror: ecode: 68 edom: 10 level: 2 message: Timed out during operation: cannot acquire state change lock Thread-105::ERROR::2013-06-11 10:15:39,001::utils::416::vm.Vm::(collect) vmId=`f7fe0970-34aa-40ff-8a46-d0ad16698a7c`::Stats function failed: <AdvancedStatsFunction _sampleDisk at 0x2911c38> Traceback (most recent call last): File "/usr/lib64/python2.6/site-packages/vdsm/utils.py", line 412, in collect statsFunction() File "/usr/lib64/python2.6/site-packages/vdsm/utils.py", line 287, in __call__ retValue = self._function(*args, **kwargs) File "/usr/share/vdsm/libvirtvm.py", line 164, in _sampleDisk diskSamples[vmDrive.name] = self._vm._dom.blockStats(vmDrive.name) File "/usr/share/vdsm/libvirtvm.py", line 539, in f raise toe TimeoutError: Timed out during operation: cannot acquire state change lock I don't know if the migration really never progressed over 90% or if it was just libvirt not being available to be queried (cannot acquire state change lock).
retaking the bug to address the migration not completed.
Libvirt produces "cannot acquire state change lock" when it cannot start a QEMU job, because there is one currently active and nesting is not allowed, or because the nesting limit has been reached. However, in the latter case, the log message will be slightly different. Given the sequence of events, the fact that the NFS domain was blocked and the error reported, the most likely cause of this is the QEMU process stuck on D state on I/O; In turn, a likely cause is the virDomainGetBlockJobInfo call VDSM does for high water mark in sampling, because the disk is configured as such <disk device="disk" snapshot="no" type="file"> <address domain="0x0000" function="0x0" slot="0x05" type="pci" bus="0x00"/> <source file="/rhev/data-center/5849b030-626e-47cb-ad90-3ce782d831b3/f15b179b-bbca-4d19-bc1b-fccb2e2fad95/images/3afcd443-c416-4897-b866-6e618df20302/194578cf-affa-47ff-afb7-01977b3d87b6"/> <target bus="virtio" dev="vda"/> <serial>3afcd443-c416-4897-b866-6e618df20302</serial> <boot order="1"/> <driver cache="none" error_policy="stop" io="threads" name="qemu" type="qcow2"/> </disk> (type="qcow2") And this is enough to explain the failed migration.
migration aside, I think the question still remains - why does the autorecovery tries to activate the host when the connection is still severed? I thought this is not expected
(In reply to Michal Skrivanek from comment #14) > migration aside, I think the question still remains - why does the > autorecovery tries to activate the host when the connection is still > severed? I thought this is not expected How is that relevant to the live migration? Even if we don't try to autorecover every 5 minutes the libvirt process is still stuck on "cannot acquire state change lock" and migration won't finish.
(In reply to Federico Simoncelli from comment #15) It is not relevant to migration but that's not the point of the bug. It's about the host state. See in original description "its status is changing from 'Non-Operational' to 'Up' and back all the time"
(In reply to Michal Skrivanek from comment #16) > (In reply to Federico Simoncelli from comment #15) > It is not relevant to migration but that's not the point of the bug. It's > about the host state. See in original description "its status is changing > from 'Non-Operational' to 'Up' and back all the time" The original description starts with: "I had a VM running on a host and I disabled access to storage nfs with iptables on this host. Then migration of this VM has been started but it does never finish." Now, I don't see how the host status changing from 'Non-Operational' to 'Up' (and back) is affecting the situation. You are not able to manage the host (e.g. move it to maintenance, etc.) because of the running VM not because of the status. We can spin off another bug and just have the status being in 'Non-Operational' but then I don't see any gain, you still can't migrate the VM, you can't manage the host, etc.
it's not affecting the migration indeed, my whole point since comment #10 is the flapping host status. I'll change the title to make it clear and let's call it a new bug then:) The code may have change since the bug is almost a year old, but I'd still suggest a review of the domain autorecovery
(In reply to Michal Skrivanek from comment #18) > it's not affecting the migration indeed, my whole point since comment #10 is > the flapping host status. I'll change the title to make it clear and let's > call it a new bug then:) I think you missed the point. Now we have a new irrelevant bug and we lost track of an important one. How do you think to track the migration issue?
there is no migration issue here. The fact that it didn't finish is expected. The fact that it took "forever" has been addressed by patches during 3.3.z and 3.4(e.g. bug 1028917). The only thing I want to keep track of is this "other" issue with host status.
according to comment 20 the migration issue is tracked (and solved) in a different bug, so this bug should only cover the host state issue. moving to storage to have a look, there is nothing else related to virt.
I wonder if this still happens in 3.6.0. Aharon - do you guys have such a test scenario (block NFS storage from non-SPM host)?
We have such a case but not sure it si on HSM We can retest it
Can you please retest?
Elad, Please do.
Tested using: rhevm-3.6.0.2-0.1.el6.noarch vdsm-4.17.8-1.el7ev.noarch - SPM host configured with power mgmt while the second host without. - Started a VM with disk attached located on a non-master domain (NFS) on the non-SPM host. Initiated OS installation on the VM so the VM will write to the disk. - Blocked connectivity between the non-SPM host (where the qemu process is running) to the storage domain where the VM disk is located. 2015-11-02 09:06:33,380 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-7-thread-46) [] Storage domain '78a4a497-4dd7-4cf5-b7a8-264a8d467f05:exportDomain' is not visible to on e or more hosts. Since the domain's type is 'ImportExport', hosts status will not be changed to non-operational 2015-11-02 09:06:34,558 INFO [org.ovirt.engine.core.vdsbroker.MigrateVDSCommand] (org.ovirt.thread.pool-7-thread-6) [5e622bee] START, MigrateVDSCommand( MigrateVDSCommandParameters:{runAsync='true', hostId='db6cd47a-68cb-4946-9622-02a073bb1ded', vmId='f8114073-a5ae-4df5-b263-837880160aa7', srcHost='aqua-vds4.qa.lab.tlv.redhat.com', dstVdsId='98f79a0b-d240-4ae5-9a89-4b4f76b7e5cb', dstHost='aqua-vds5.qa.lab.tlv.redhat.com:54321', migrationMethod='ONLINE', tunnelMigration='false', migrationDowntime='0', autoConverge='false', migrateCompressed='false', consoleAddress='null'}), log id: 6bcc93e0 Results: - Host is flapping between 'connecting' and 'non-operational' (as should be). - VM migration fails. Logs from engine and hosts will be attached
Created attachment 1088501 [details] logs-3.6
Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.
(In reply to Elad from comment #28) > Results: > - Host is flapping between 'connecting' and 'non-operational' (as should be). > - VM migration fails. So, what's the AI here?
Moving from 4.0 alpha to 4.0 beta since 4.0 alpha has been already released and bug is not ON_QA.
As Elad tests results were as exepected it seems as there's no issue here, closing. please reopen if relevant.